Our Apache Yarn cluster hosts the flights data representing 123 million flights over 22 years. Read the lecture notes on how to access the Yarn cluster. Connect to the database using sparklyr and answer following questions. You can base your answers on a specific year or the whole data set.

Busiest Airports

Map the top 10 busiest airports. Size of dots should reflect the number of flights through that destination. Hint: You may find this tutorial on Making Maps in R helpful.

The top 10 busiest airports are all in the United States:

Busiest Routes

Map the top 10 busiest direct routes. Size of lines should reflect the number of flights through that route.

Again, the busiest routes are within the United States:

LAX:

Reproduce plot.

Visualize and explain some prominent features you observe. For example, what happened at points 1-5?

Point Possible Explanation
1 Two airplanes were hijacked and flew into the Twin Towers which tightend airport security and heightened fear of flying.
2 This seems to be close to Thanksgiving 2004 so people may not be flying as much on a holiday. There seem to be similar dips in other years.
3 This seems to be close to 4th of July 2004 so again, maybe people aren’t flying as much when they could be BBQ-ing.
4 There was a recession. Jet fuel was expensive and airlines were trying to raise the price of flights. This caused a dip in flights for the rest of the year.
5 From some Googling, it appears LAX underwent some renovations beginning in 2000 which could attract more flyers.

Visualize and explain seasonal effects.

From the plot, we can see that there are consistently fewer flights during the winter, most likely due to the weather. There are usually more flights during the summer most likely because people are able to take more vacations.

Visualize and explain weekly effects.

There are consistently fewer flights on Fridays and Saturdays. If we are speculating, this could be because people want to travel on a weekday in order to enjoy their vacations on the weekend, or it could be because businesspeople fly on workdays.

Map top 10 destinations from LAX.

Size of dots should reflect the number of flights from LAX to that destination.

Arrival Delay Prediction

Build a predictive model for the arrival delay (arrdelay) of flights flying from LAX. Use the same filtering criteria as in the lecture notes to construct training and validation sets. You are allowed to use a maximum of 5 predictors. The prediction performance of your model on the validation data set will be an important factor for grading this question.

## Call: ml_linear_regression.tbl_spark(., arrdelay ~ distance + depdelay + uniquecarrier)  
## 
## Deviance Residuals (approximate):
##      Min       1Q   Median       3Q      Max 
## -193.787   -7.074   -1.377    5.197  332.139 
## 
## Coefficients:
##      (Intercept)         distance         depdelay uniquecarrier_WN 
##      -0.69481192      -0.00308328       1.01217764      -1.93405615 
## uniquecarrier_OO uniquecarrier_AA uniquecarrier_UA uniquecarrier_DL 
##       0.74734641       3.52663710       1.34094813       3.48834362 
## uniquecarrier_AS uniquecarrier_MQ uniquecarrier_CO uniquecarrier_NW 
##       0.07762555       0.13152935       4.92376907       1.26046455 
## uniquecarrier_US uniquecarrier_HP uniquecarrier_XE uniquecarrier_F9 
##       2.26862299       0.23483827       1.85339215       2.09218498 
## uniquecarrier_TZ uniquecarrier_FL uniquecarrier_EV uniquecarrier_YV 
##       5.18135737       2.98744942       0.89537424       0.21310320 
## uniquecarrier_HA uniquecarrier_DH 
##       0.18129594      11.80158875 
## 
## R-Squared: 0.8852
## Root Mean Squared Error: 13.33

Other Insights

Visualize and explain any other information you want to explore.

Surprisingly, most delays were not caused by security or weather, but mainly by NAS and late aircraft.