Sean Hall and Julia Baratta's Data Science Project

The goal of this project is to evaluate whether we can predict the size of an airline's arrival delay from information a flyer knows before the plane takes off. We chose to evaluate arrival delay (the difference between scheduled arrival and actual arrival) because we believe it is more important to know when a plane will land than when it will take off. This information could come in handy for a variety of reasons, including flight connections, flights to weddings, flights to funerals, and flights to interviews.

To achieve this goal, we imported, cleaned, analyzed, and merged a dataset about flight information from 2013 to 2023 and datasets about weather in major US cities from 2013 to 2023. We called this final merged datset 'MLready' and it can be found using this link. We then used MLready to create a random forest regression machine learning model that could predict flight delays with an R^2 value of .93 on average. Finally, we used our model to predict the flight delays for both of our flights home for winter break.

Here's the link to our website: Visit Sean and Julia's Website

For more information, here's the link to our github repo: Visit Sean and Julia's GitHub Repo

In this cell, we import necessary libraries and initialize the starting year and ending year. We chose 2013-2023 to investigate a solid ten-year span of airline data without clouding our new conclusions with data too far in the past.

From the large csv files, we selected relevant columns to read in. We also limited our dataset to include only flights to and from 10 major airports. Airport codes can be found on the Bureau of Transportation's website (to the right of the Origin Airport ID row titled 'Get Lookup Table'): https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr

Additionally, we chose to start a timer and initialize an empty list where the dataframes will be stored.

The main purpose of this cell is to filter the airline data to only include years 2013 to 2023 and flights that are flown through our chosen airlines. Below, a for loop reads in the files from my computer and combines them into a single dataframe. Note: the year 2023 only has 8 months of data. This code will print an exception if a file was not downloaded properly. Additionally, only rows where the columns "Dest AirportID" and "OriginAirportID" have a value in the Top10AirportCodes are read into the large file. This cell also documents time and lets me know when the program is complete. The data set is then read to a file (called 10Airlines.csv) in the folder airline_data.

This cell reads in the DataFrame and Foreign Keys, which map codes to strings, cleans the data, and fills in missing values where they make sense. Notably, we decided to drop the cancellation rows as this would confuse the machine learning model for predicing delays.

2% of rows have NaN values. We looked for a correlation between these rows and none existed. We decided that these NaN values should not affect our data, and dropped the rows.

Exploratory analysis: A brief look at correlations via a correlation matrix for numeric columns.

Then, we thought it was important to quanitfy a good threshold for 'delay' was. The question we hoped to answer here was: If a flight was delayed for 1 minute, is it really considered delayed? What about 2 miniutes? 5? 10?

5 minute delay is a bit above the median so this will be our threshold for a delay. We then create a new Boolean column for whether the flight was delayed or not. Then we rewrite the DataFrame into a csv for convenience. This csv is titled 'PreWeatherML' because the next step here is to merge this df with weather data in the 10 cities that we focused on.

Here's some additional exploration of the Airlines Data set which highlights the average arrival delay per airline per year.

Lastly, we found some fun facts out of curiousity

To prepare for merging, we first import necessary libraries. We also added a line to ignore warningings to clean up the notebook.

Our second dataset consists of snow and rain fall in major US cities every day of the year from 2013 to 2023. Here, we have read in the file for New York City. Our goal is to use weather in the top 10 busiest airports in the US to generalize the affect that rain and snow have on airport delays in the US. In the graph below, we can see the yearly and seasonal variation in total precipitation. Orange spikes clearly indicate winters.

Also, here's a link to the september floods in NYC, which corresponds to the large rain tick at the end of the data. https://en.wikipedia.org/wiki/September_2023_New_York_floods

Zoomed in, we can see the daily rain and snow in January of 2023. We can see here that there was no snow in January of 2023 in NYC.

Here we can see the correlation between delays leaving NYC and precipitation. Some spikes line up prefectly (Jan 2nd and 12th) and some spikes are shifted by one day. This indicates that weather can cause delays the same day or the day after. To encorporate this observation, we decided to combine the precipitation and snow for the last two days into a serperate variable for machine learning.

Additionally, we can see that a similar correlation exists between arrivals in NYC and precipitation.

In fact, delays leaving NYC and delays going into NYC are almost perfectly correlated. Clearly, delays are unaffected by whether the airport is incoming or outgoing. Below are some additional relationships that can be derrived from the weather dataset.

2016 & 2022 did not get a lot of rain.

Here's a link to a wiki page that talks about the 2016 drought in NY:

https://en.wikipedia.org/wiki/2016_New_York_drought

Here's a link to a drought declaration in the summer months of 2022:

https://www.dec.ny.gov/press/126319.html#:~:text=In%20August%202022%2C%20Governor%20Kathy,flows%20and%20low%20groundwater%20levels.

We can see here that snow levels have been decreasing on average. This is possibly a small-scale reflection of atmospheric warming. Here's a link to a blog post about how snow in New York is decreasing over the years:

https://medium.com/talk-new-york-city/is-snow-in-new-york-city-becoming-more-or-less-common-in-recent-years-0d4a6ed9bcdc

Now that we have inspected some trends in a subset of the weather data, we move forward in merging the weather and airline dataset. We chose this approach so that we can feed in one dataframe to the ML model by selecting dataframe columns as variables. To prepare for merging, we read in the 10 weather datasets and renamed the columns to match the names in the airline dataset.

Additionally, we created two columns in the weather datasets that account for the weather for the last two days. Then, we checked for any NaN values.

3 datasets have NaN values. 2 of them (SEW and DEW) are a small enough portion of the dataset that we're going to fill them with zero. The other (LAX) has almost all SNOW days filled with NaN values. Here's a link (https://www.laalmanac.com/weather/we17.php) to weather data reporting snow days in LA. Snowing is infrequent enought that I will fill nan values with 0.

Next, we adjusted the columns fo the 10Airlines (df) dataset to match that of the weather data. Primarily this consisted of renaming month columns using a month map (mm), and converting Day of Month and Year to the right format. Then, we created a date column that matched that of the weather column 'DATE'.

Then, we renamed the airline columns to match the weather data and concatonated every weather dataset into one large dataset titled 'concat_weather'. Finally, we can merge the two datasets. The first merging was done for the origin airport and merged on 'ORIGIN' and 'DATE' this added weather data for the airline that the flight was departing from. Then, columns were renamed to include that they were associated with the origin to avoid confusion.

For the second merging, the 'ORIGIN' column of the 'concat_weather' dataset was renamed to be 'DEST'. After the columns were prepped, the dataset was merged on 'DEST' and 'DATE' This allowed the same weather data to act as data for both the origin and destination.

The following cells checks the correctness of merging, crossreferencing the original data present in each individual weather dataset for both the origin airport and the destination airport. We do this at three random places within the entire dataset.

Finally, we set the index to datetime, and read this merged file into one large final csv file which acts as the input for our machine learning model.

To prepare for machine learning, we first input the necessary libraries which include a onehot encoder, column transformer and standard scalar to dummy enocde and scale the data within the pipeline. Then, we import the Random Forest Regressor to act as an extention of the Decision Tree learning method that we learned in class. Finally, we import in the mean_absolute_error and r2_score to evaluate the effectiveness of the model.

Here, the cleaned and merged dataset 'MLReady.csv' gets read in. Relevant features are defined as well as the variable that we will be predicting. We are quantitatively predicting the delay time, so we chose a regressor.

It's important to note here that we do include the DepDelayMinutes as a variable in the regression model because it is data that a flyer would know before the flight takes off. We invision this being useful becuase a flight is often delayed incrementally, so if a flyer first hears that their flight has been deleayed 15 minutes, they might want to know the likelyhood that the flight will be delayed further based on other factors about the flight.

We then define the categorical and numerical features of the model. We then created a list to store the evaluation metrics of each year.

We transformed the categorical variables using One Hot Encoding. We also defined the scaler for the pipeline (Z-standardizing) and the preprocessor to deal with the cateogrical features and create the model.

In order to account for trends the previous year and predict the delays of the next, we implimented a for loop to iterate over each year of the dataset, train a model on that year, and then test the accuracy of that model to predict the next year. This requires a rolling regression model. The model then print its error for the current iteration, then after the for loop terminates, prints the average.

In the model, "n_estimators = 10" can be increased for a more accurate depiction. However, due to the size of our data, n=10 will take 20+ minutes to load on an average computer so we figured this to be sufficient.

Below is a visualization of the MAE and R^2 values of our model evaluated every year.

As we can see, the results are pretty accurate.

To further test the model, we will evaluate two real flights (both of our returning flights for winter break). Because 2023 was an incomplete year, we will train on 2022. Below you will find the input for our flights back home. We will take this input and turn it into a separate DataFrame.

We will use the same model as before. Here we can see that Sean's flight was predicted to be delayed 3 minutes and Julia's was predicted to be delayed 11 minutes. We attribute this larger delay to the rain that was predicted to be in Washington DC on the day of travel (determined through a weather app).

Notably: because New Orleans, and Philadelphia (Julia's Destination) were not cities we considered, we replaced those with Atlanta and Washinton DC respectivly (weather in ATL and DC were used as well to keep it accurate). This highlights one fault of our model: that it only considers the 10 most popular airports.