Our Apache Yarn cluster hosts the flights data representing 123 million flights over 22 years. Read the lecture notes on how to access the Yarn cluster. Connect to the database using sparklyr
and answer following questions. You can base your answers on a specific year or the whole data set.
Map the top 10 busiest airports. Size of dots should reflect the number of flights through that destination. Hint: You may find this tutorial on Making Maps in R helpful.
The top 10 busiest airports are all in the United States:
Map the top 10 busiest direct routes. Size of lines should reflect the number of flights through that route.
Again, the busiest routes are within the United States:
Visualize and explain some prominent features you observe. For example, what happened at points 1-5?
Point | Possible Explanation |
---|---|
1 | Two airplanes were hijacked and flew into the Twin Towers which tightend airport security and heightened fear of flying. |
2 | This seems to be close to Thanksgiving 2004 so people may not be flying as much on a holiday. There seem to be similar dips in other years. |
3 | This seems to be close to 4th of July 2004 so again, maybe people aren’t flying as much when they could be BBQ-ing. |
4 | There was a recession. Jet fuel was expensive and airlines were trying to raise the price of flights. This caused a dip in flights for the rest of the year. |
5 | From some Googling, it appears LAX underwent some renovations beginning in 2000 which could attract more flyers. |
From the plot, we can see that there are consistently fewer flights during the winter, most likely due to the weather. There are usually more flights during the summer most likely because people are able to take more vacations.
There are consistently fewer flights on Fridays and Saturdays. If we are speculating, this could be because people want to travel on a weekday in order to enjoy their vacations on the weekend, or it could be because businesspeople fly on workdays.
Size of dots should reflect the number of flights from LAX to that destination.
Build a predictive model for the arrival delay (arrdelay
) of flights flying from LAX. Use the same filtering criteria as in the lecture notes to construct training and validation sets. You are allowed to use a maximum of 5 predictors. The prediction performance of your model on the validation data set will be an important factor for grading this question.
## Call: ml_linear_regression.tbl_spark(., arrdelay ~ distance + depdelay + uniquecarrier)
##
## Deviance Residuals (approximate):
## Min 1Q Median 3Q Max
## -193.787 -7.074 -1.377 5.197 332.139
##
## Coefficients:
## (Intercept) distance depdelay uniquecarrier_WN
## -0.69481192 -0.00308328 1.01217764 -1.93405615
## uniquecarrier_OO uniquecarrier_AA uniquecarrier_UA uniquecarrier_DL
## 0.74734641 3.52663710 1.34094813 3.48834362
## uniquecarrier_AS uniquecarrier_MQ uniquecarrier_CO uniquecarrier_NW
## 0.07762555 0.13152935 4.92376907 1.26046455
## uniquecarrier_US uniquecarrier_HP uniquecarrier_XE uniquecarrier_F9
## 2.26862299 0.23483827 1.85339215 2.09218498
## uniquecarrier_TZ uniquecarrier_FL uniquecarrier_EV uniquecarrier_YV
## 5.18135737 2.98744942 0.89537424 0.21310320
## uniquecarrier_HA uniquecarrier_DH
## 0.18129594 11.80158875
##
## R-Squared: 0.8852
## Root Mean Squared Error: 13.33
Visualize and explain any other information you want to explore.
Surprisingly, most delays were not caused by security or weather, but mainly by NAS and late aircraft.