The goal of this project is to evaluate whether we can predict the size of an airline's arrival delay from information a flyer knows before the plane takes off. We chose to evaluate arrival delay (the difference between scheduled arrival and actual arrival) because we believe it is more important to know when a plane will land than when it will take off. This information could come in handy for a variety of reasons, including flight connections, flights to weddings, flights to funerals, and flights to interviews.
To achieve this goal, we imported, cleaned, analyzed, and merged a dataset about flight information from 2013 to 2023 and datasets about weather in major US cities from 2013 to 2023. We called this final merged datset 'MLready' and it can be found using this link. We then used MLready to create a random forest regression machine learning model that could predict flight delays with an R^2 value of .93 on average. Finally, we used our model to predict the flight delays for both of our flights home for winter break.
Here's the link to our website: Visit Sean and Julia's Website
For more information, here's the link to our github repo: Visit Sean and Julia's GitHub Repo
In this cell, we import necessary libraries and initialize the starting year and ending year. We chose 2013-2023 to investigate a solid ten-year span of airline data without clouding our new conclusions with data too far in the past.
From the large csv files, we selected relevant columns to read in. We also limited our dataset to include only flights to and from 10 major airports. Airport codes can be found on the Bureau of Transportation's website (to the right of the Origin Airport ID row titled 'Get Lookup Table'): https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr
Additionally, we chose to start a timer and initialize an empty list where the dataframes will be stored.
'''housekeeping cell'''
import pandas as pd
import time
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
start_year = 2013
end_year = 2023
#relevant columns
selected_columns = ['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DOT_ID_Reporting_Airline',
'OriginAirportID', 'OriginCityName',
'DestAirportID','DestCityName', 'CRSDepTime', 'DepTime', 'DepDelayMinutes',
'TaxiOut', 'TaxiIn', 'CRSArrTime', 'ArrTime', 'ArrDelayMinutes',
'Cancelled', 'CancellationCode', 'Diverted', 'CRSElapsedTime',
'ActualElapsedTime', 'AirTime', 'Distance', 'CarrierDelay', 'WeatherDelay',
'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
Top10AirportCodes = [11298, 12892, 10397, 11292, 12266, 12264, 13204, 13930, 12478, 14747]
start_time = time.time()
data_frames = []
The main purpose of this cell is to filter the airline data to only include years 2013 to 2023 and flights that are flown through our chosen airlines. Below, a for loop reads in the files from my computer and combines them into a single dataframe. Note: the year 2023 only has 8 months of data. This code will print an exception if a file was not downloaded properly. Additionally, only rows where the columns "Dest AirportID" and "OriginAirportID" have a value in the Top10AirportCodes are read into the large file. This cell also documents time and lets me know when the program is complete. The data set is then read to a file (called 10Airlines.csv) in the folder airline_data.
#includes 2013-2023
for year in range(start_year, end_year + 1):
#this is specific to the time at which we uploaded the data
#change '8' if necessary
max_month = 12 if year != 2023 else 8
for month in range(1, max_month + 1):
file_path = f"airline_data/{year}/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_{year}_{month}/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_{year}_{month}.csv"
try:
df = pd.read_csv(file_path, usecols=selected_columns)
df = df[df['OriginAirportID'].isin(Top10AirportCodes) & df['DestAirportID'].isin(Top10AirportCodes)]
data_frames.append(df)
#prints error message if something goes wrong
except FileNotFoundError:
print(f"File not found for {year}-{month}")
final_df = pd.concat(data_frames, ignore_index=True)
#keeps track of time spent, not necesary
end_time = time.time()
execution_time = end_time - start_time
print(f"Execution time: {execution_time} seconds")
output_dir = 'airline_data'
os.makedirs(output_dir, exist_ok=True)
#writes large csv file with the data that we want
final_df.to_csv(os.path.join(output_dir, '10Airlines.csv'), index=False)
This cell reads in the DataFrame and Foreign Keys, which map codes to strings, cleans the data, and fills in missing values where they make sense. Notably, we decided to drop the cancellation rows as this would confuse the machine learning model for predicing delays.
df = pd.read_csv('airline_data/10Airlines.csv')
#Cleaning CancellationCodes
df["CancellationReason"] = df["CancellationCode"].replace({'A':"Carrier",'B':"Weather",'C':"National Air System",'D':"Security",np.nan:'No Delay'})
df.drop(columns=['CancellationCode'],inplace=True)
#Cleaning Airline name
DOT_ID_Airline_df = pd.read_csv('airline_data/L_AIRLINE_ID.csv')
ID_map_dict = dict(zip(DOT_ID_Airline_df['Code'], DOT_ID_Airline_df['Description'].str.split(':').str[0]))
df['Airline'] = df['DOT_ID_Reporting_Airline'].map(ID_map_dict)
df.drop(columns=['DOT_ID_Reporting_Airline'],inplace=True)
#Fix this outlier
df['Airline'] = df['Airline'].replace('ExpressJet Airlines LLC d/b/a aha!', 'ExpressJet Airlines')
#Cleaning Delays (Some airlines did NaN on delays when they didn't exist. These are equivalent to a 0 time delay.)
fill_columns = ['CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay']
df[fill_columns] = df[fill_columns].fillna(0)
#Cleaning DayOfWeek
DayOfWeek_mapping = {1:'Mon',2:'Tues',3:'Wed',4:'Thurs',5:'Fri',6:'Sat',7:'Sun'}
df['DayOfWeek'] = df['DayOfWeek'].map(DayOfWeek_mapping)
#Cleaning OriginAirportID
ForeignKey = pd.read_csv('airline_data/L_AIRPORT_ID.csv')
ForeignKeyDict = dict(zip(ForeignKey['Code'], ForeignKey['Description']))
df['OriginAirport'] = df['OriginAirportID'].map(ForeignKeyDict)
df.drop(columns=['OriginAirportID'],inplace=True)
#Cleaning Months
ForeignKey = pd.read_csv('airline_data/L_MONTHS.csv')
ForeignKeyDict = dict(zip(ForeignKey['Code'], ForeignKey['Description']))
df['Month'] = df['Month'].map(ForeignKeyDict)
#Cleaning DestAirportID
ForeignKey = pd.read_csv('airline_data/L_AIRPORT_ID.csv')
ForeignKeyDict = dict(zip(ForeignKey['Code'], ForeignKey['Description']))
df['DestAirport'] = df['DestAirportID'].map(ForeignKeyDict)
df.drop(columns=['DestAirportID'],inplace=True)
#Dropping Canceled Rows for Regression Model
df = df[df['Cancelled'] == 0]
2% of rows have NaN values. We looked for a correlation between these rows and none existed. We decided that these NaN values should not affect our data, and dropped the rows.
df.dropna(inplace=True)
Exploratory analysis: A brief look at correlations via a correlation matrix for numeric columns.
#can read in if need be
df = pd.read_csv('PreWeatherML.csv')
target_columns = ['ArrDelayMinutes']
numeric_columns = df.select_dtypes(include=['number']).columns
correlation_matrix = df[numeric_columns].corr()
correlations_with_target = correlation_matrix[target_columns]
correlated_columns = correlations_with_target.drop(target_columns).sort_values(by='ArrDelayMinutes', ascending=False)
print(f"\nColumns most correlated with 'ArrDelayMinutes':")
print(correlated_columns)
Columns most correlated with 'ArrDelayMinutes': ArrDelayMinutes DepDelayMinutes 0.968142 CarrierDelay 0.674785 LateAircraftDelay 0.625217 NASDelay 0.416574 WeatherDelay 0.289817 TaxiOut 0.198743 DepTime 0.111587 TaxiIn 0.101011 CRSDepTime 0.082757 CRSArrTime 0.067497 SecurityDelay 0.032922 ActualElapsedTime 0.029687 Year 0.017994 DayofMonth 0.002138 AirTime -0.003832 ArrTime -0.008112 CRSElapsedTime -0.015637 Distance -0.019258 Cancelled NaN Diverted NaN
Then, we thought it was important to quanitfy a good threshold for 'delay' was. The question we hoped to answer here was: If a flight was delayed for 1 minute, is it really considered delayed? What about 2 miniutes? 5? 10?
print("Decile 1:", df['ArrDelayMinutes'].quantile(0.10))
print("Decile 2:", df['ArrDelayMinutes'].quantile(0.20))
print("Decile 3:", df['ArrDelayMinutes'].quantile(0.30))
print("Decile 4:", df['ArrDelayMinutes'].quantile(0.40))
print("Decile 5:", df['ArrDelayMinutes'].quantile(0.50))
print("Decile 6:", df['ArrDelayMinutes'].quantile(0.60))
print("Decile 7:", df['ArrDelayMinutes'].quantile(0.70))
print("Decile 8:", df['ArrDelayMinutes'].quantile(0.80))
print("Decile 9:", df['ArrDelayMinutes'].quantile(0.90))
print("Decile 10:", df['ArrDelayMinutes'].quantile(1.0))
Decile 1: 0.0 Decile 2: 0.0 Decile 3: 0.0 Decile 4: 0.0 Decile 5: 0.0 Decile 6: 0.0 Decile 7: 5.0 Decile 8: 15.0 Decile 9: 39.0 Decile 10: 2837.0
5 minute delay is a bit above the median so this will be our threshold for a delay. We then create a new Boolean column for whether the flight was delayed or not. Then we rewrite the DataFrame into a csv for convenience. This csv is titled 'PreWeatherML' because the next step here is to merge this df with weather data in the 10 cities that we focused on.
df['Delayed'] = df['ArrDelayMinutes'] >= 5
df.to_csv('PreWeatherML.csv', index=False, mode='w')
Here's some additional exploration of the Airlines Data set which highlights the average arrival delay per airline per year.
#df = pd.read_csv('MLready.csv')
grouped_df = df.groupby(['Year', 'Airline'])['ArrDelayMinutes'].mean().reset_index()
sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))
sns.barplot(x='Year', y='ArrDelayMinutes', hue='Airline', data=grouped_df)
plt.title('Average Delays per Airline per Year')
plt.xlabel('Year')
plt.ylabel('Average Arrival Delay Minutes per Flight')
plt.show()
Lastly, we found some fun facts out of curiousity
#df = pd.read_csv('PreWeatherML.csv')
#Airline with the Most Delays:
most_delayed_airline = df.groupby('Airline')['ArrDelayMinutes'].mean().idxmax()
print(f"The airline with the most delays is {most_delayed_airline}.")
#Month with the Most Delays:
month_most_delays = df.groupby('Month')['ArrDelayMinutes'].mean().idxmax()
print(f"The month with the most delays is {month_most_delays}.")
#Year with the Most Delays:
year_most_delays = df.groupby('Year')['ArrDelayMinutes'].mean().idxmax()
print(f"The year with the most delays is {year_most_delays}.")
#Airport with the Longest Average Taxi Out Time:
airport_longest_taxi_out = df.groupby('OriginAirport')['TaxiOut'].mean().idxmax()
print(f"The airport with the longest average taxi out time is {airport_longest_taxi_out}.")
#Average Delay by City:
avg_arrival_delay_by_city = df.groupby('DestCityName')['ArrDelayMinutes'].mean().idxmax()
print(f"The city with the highest average arrival delay is {avg_arrival_delay_by_city}.")
#Busiest Airport (Most Flights):
busiest_airport = df['OriginAirport'].value_counts().idxmax()
print(f"The busiest airport (with the most flights) is {busiest_airport}.")
#Month with the Highest Average Airtime:
month_highest_airtime = df.groupby('Month')['AirTime'].mean().idxmax()
print(f"The month with the highest average airtime is {month_highest_airtime}.")
#Most Common Delay Type:
most_common_delay_type = df[['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']].idxmax(axis=1).mode().values[0]
print(f"The most common delay type is {most_common_delay_type}.")
#Day of the Week with the Fewest Delays:
day_fewest_delays = df.groupby('DayOfWeek')['ArrDelayMinutes'].mean().idxmin()
print(f"The day of the week with the fewest delays is {day_fewest_delays}.")
The airline with the most delays is JetBlue Airways. The month with the most delays is June. The year with the most delays is 2023. The airport with the longest average taxi out time is New York, NY: John F. Kennedy International. The city with the highest average arrival delay is New York, NY. The busiest airport (with the most flights) is Los Angeles, CA: Los Angeles International. The month with the highest average airtime is December. The most common delay type is CarrierDelay. The day of the week with the fewest delays is Tues.
To prepare for merging, we first import necessary libraries. We also added a line to ignore warningings to clean up the notebook.
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import warnings
warnings.filterwarnings("ignore")
Our second dataset consists of snow and rain fall in major US cities every day of the year from 2013 to 2023. Here, we have read in the file for New York City. Our goal is to use weather in the top 10 busiest airports in the US to generalize the affect that rain and snow have on airport delays in the US. In the graph below, we can see the yearly and seasonal variation in total precipitation. Orange spikes clearly indicate winters.
Also, here's a link to the september floods in NYC, which corresponds to the large rain tick at the end of the data. https://en.wikipedia.org/wiki/September_2023_New_York_floods
'''importing a csv file containing daily precipitation and snow in NYC for the last 30 years
# rain and snow measured in tenths of a mm'''
#this is a sample import and was done for all 10 cities that we focused on.
weather = pd.read_csv("/Users/juliabaratta/Desktop/weather_data/USW00094789.csv")
'''deletes select columns by creating a list of all columns'''
columns = weather.columns.tolist()
'''deletes columns I want from that list '''
for i in ['STATION', 'DATE','NAME', 'PRCP','SNOW']:
columns.remove(i)
'''drops the list'''
weather_dropped = weather.drop(columns = columns)
'''gets only the last tenish years worth of days'''
rows = weather_dropped.tail(3967)
'''writes file with relevent columns and rows to a folder '''
file_path = "/Users/juliabaratta/Desktop/weather_data/newyork.csv"
rows.to_csv(file_path, index = False)
rows = pd.read_csv("/Users/juliabaratta/Desktop/weather_data/newyork.csv")
#creates a column called YEAR based on a slice of the DATE column
rows['YEAR'] = rows["DATE"].str[:4]
#sets index and plot
rows = rows.set_index('YEAR')
rows.plot()
<AxesSubplot:xlabel='YEAR'>
Zoomed in, we can see the daily rain and snow in January of 2023. We can see here that there was no snow in January of 2023 in NYC.
'''Inspects a slice of the data to get an idea of weather trends'''
#selects only Jan2023
w_Jan2023 = rows[3652:3683]
#creates a DAY column
w_Jan2023['DAY'] = w_Jan2023["DATE"].str[8:]
w_Jan2023.set_index("DAY", inplace = True)
w_Jan2023.plot()
<AxesSubplot:xlabel='DAY'>
Here we can see the correlation between delays leaving NYC and precipitation. Some spikes line up prefectly (Jan 2nd and 12th) and some spikes are shifted by one day. This indicates that weather can cause delays the same day or the day after. To encorporate this observation, we decided to combine the precipitation and snow for the last two days into a serperate variable for machine learning.
#reads in airline data for Jan of 2023
a_Jan2023 = pd.read_csv("/Users/juliabaratta/Desktop/airline_data/Jan2023.csv")
#selects only flights going to NYC
a_Jan2023NY_d = a_Jan2023[a_Jan2023['DEST_CITY_NAME'] == 'New York, NY']
file_path = "/Users/juliabaratta/Desktop/a_Jan2023NY_d.csv"
a_Jan2023NY_d.to_csv(file_path, index = False)
#selects only flights leaving NYC
a_Jan2023NY_o = a_Jan2023[a_Jan2023['ORIGIN_CITY_NAME'] == 'New York, NY']
file_path = "/Users/juliabaratta/Desktop/a_Jan2023NY_o.csv"
a_Jan2023NY_o.to_csv(file_path, index = False)
a_Jan2023NY_d = pd.read_csv("/Users/juliabaratta/Desktop/a_Jan2023NY_d.csv")
a_Jan2023NY_o = pd.read_csv("/Users/juliabaratta/Desktop/a_Jan2023NY_o.csv")
#creates a groupby object for the sum of all delayed minutes on each day of January
d_by_dowJan2023_o = (a_Jan2023NY_o.groupby('DAY_OF_MONTH')['DEP_DELAY_NEW'].sum())
d_by_dowJan2023_d = (a_Jan2023NY_d.groupby('DAY_OF_MONTH')['DEP_DELAY_NEW'].sum())
#standardizes
d_by_dowJan2023_std_o = (
(d_by_dowJan2023_o - d_by_dowJan2023_o.mean()) /
d_by_dowJan2023_o.std())
d_by_dowJan2023_std_d = (
(d_by_dowJan2023_d - d_by_dowJan2023_d.mean()) /
d_by_dowJan2023_d.std())
w_Jan2023_std = (
(w_Jan2023 - w_Jan2023.mean()) /
w_Jan2023.std())
#converts to df
w_Jan2023_std = pd.DataFrame(w_Jan2023_std)
#plots delays and precipitation
plt.plot(d_by_dowJan2023_std_o, label = 'delays leaving NYC')
plt.plot(w_Jan2023_std.index, w_Jan2023_std['PRCP'], label = 'precip')
plt.xlabel('Day of Month')
plt.legend()
plt.gca().xaxis.set_major_locator(MaxNLocator(nbins=15))
Additionally, we can see that a similar correlation exists between arrivals in NYC and precipitation.
In fact, delays leaving NYC and delays going into NYC are almost perfectly correlated. Clearly, delays are unaffected by whether the airport is incoming or outgoing. Below are some additional relationships that can be derrived from the weather dataset.
plt.plot(d_by_dowJan2023_std_d, label = 'delays going to NYC')
plt.plot(w_Jan2023_std.index, w_Jan2023_std['PRCP'], label = 'precip')
plt.xlabel('Day of Month')
plt.legend()
plt.gca().xaxis.set_major_locator(MaxNLocator(nbins=15))
2016 & 2022 did not get a lot of rain.
Here's a link to a wiki page that talks about the 2016 drought in NY:
https://en.wikipedia.org/wiki/2016_New_York_drought
Here's a link to a drought declaration in the summer months of 2022:
#averages yearly precipitation in NYC
yearly_precip = rows.groupby("YEAR")["PRCP"].mean()
yearly_precip.plot()
<AxesSubplot:xlabel='YEAR'>
We can see here that snow levels have been decreasing on average. This is possibly a small-scale reflection of atmospheric warming. Here's a link to a blog post about how snow in New York is decreasing over the years:
#averages yearly snow in NYC
yearly_precip = rows.groupby("YEAR")["SNOW"].mean()
yearly_precip.plot()
<AxesSubplot:xlabel='YEAR'>
Now that we have inspected some trends in a subset of the weather data, we move forward in merging the weather and airline dataset. We chose this approach so that we can feed in one dataframe to the ML model by selecting dataframe columns as variables. To prepare for merging, we read in the 10 weather datasets and renamed the columns to match the names in the airline dataset.
Additionally, we created two columns in the weather datasets that account for the weather for the last two days. Then, we checked for any NaN values.
#read in the '10AirlinesTransformed.csv'dataset
#df = pd.read_csv('10AirlinesTransformed.csv')
#creates a datetime variable
df['datetime_column'] = pd.to_datetime(df['datetime_column'])
df['Month'] = df['datetime_column'].dt.strftime('%B')
#reads in all the weather data for 2013 to 2023
NYW = pd.read_csv('weather_data/newyork.csv')
SEW = pd.read_csv('weather_data/seattle.csv')
CHW = pd.read_csv('weather_data/chicago.csv')
ORW = pd.read_csv('weather_data/orlando.csv')
DCW = pd.read_csv('weather_data/washingtondc.csv')
HOW = pd.read_csv('weather_data/houston.csv')
DEW = pd.read_csv('weather_data/denver.csv')
ATW = pd.read_csv('weather_data/atlanta.csv')
LAW = pd.read_csv('weather_data/losangeles.csv')
DAW = pd.read_csv('weather_data/dalas.csv')
#names the weather city so they match
NYW.NAME = 'New York, NY'
SEW.NAME = 'Seattle, WA'
CHW.NAME = 'Chicago, IL'
ORW.NAME = 'Orlando, FL'
DCW.NAME = 'Washington, DC'
HOW.NAME = 'Houston, TX'
DEW.NAME = 'Denver, CO'
ATW.NAME = 'Atlanta, GA'
LAW.NAME = 'Los Angeles, CA'
DAW.NAME = 'Dallas/Fort Worth, TX'
#makes a rain/snow in the last 2 days variable for weather
l = [NYW,SEW,CHW,ORW,DCW,HOW,DEW,ATW,LAW,DAW]
for el in l:
el['OriginSnow_Last2'] = el['SNOW'] + el['SNOW'].shift(1)
el.loc[el.index[0], 'OriginSnow_Last2'] = el.loc[el.index[0], 'SNOW']
for el in l:
el['OriginPrcp_Last2'] = el['PRCP'] + el['PRCP'].shift(1)
el.loc[el.index[0], 'OriginPrcp_Last2'] = el.loc[el.index[0], 'PRCP']
#iter through the list of weather data to check for NaN values
for el in l:
print(el.NAME.iloc[0])
print(str(el.isna().sum().sum()))
New York, NY 0 Seattle, WA 3 Chicago, IL 0 Orlando, FL 0 Washington, DC 0 Houston, TX 0 Denver, CO 10 Atlanta, GA 0 Los Angeles, CA 7611 Dallas/Fort Worth, TX 0
3 datasets have NaN values. 2 of them (SEW and DEW) are a small enough portion of the dataset that we're going to fill them with zero. The other (LAX) has almost all SNOW days filled with NaN values. Here's a link (https://www.laalmanac.com/weather/we17.php) to weather data reporting snow days in LA. Snowing is infrequent enought that I will fill nan values with 0.
SEW = SEW.fillna(0)
DEW = DEW.fillna(0)
LAW = LAW.fillna(0)
Next, we adjusted the columns fo the 10Airlines (df) dataset to match that of the weather data. Primarily this consisted of renaming month columns using a month map (mm), and converting Day of Month and Year to the right format. Then, we created a date column that matched that of the weather column 'DATE'.
#changes months to numerical
df['Month'] = df['Month'].astype(str)
mm = {
'January': '01',
'February': '02',
'March': '03',
'April': '04',
'May': '05',
'June': '06',
'July': '07',
'August': '08',
'September': '09',
'October': '10',
'November': '11',
'December': '12'}
df['Month'] = df['Month'].map(mm)
#converts day to an object of the right format
if df['DayofMonth'].dtype == 'int64':
df['DayofMonth'] = df['DayofMonth'].apply(lambda x: f'{x:02}')
#converts year to an object of the same format
df['Year'] = df['Year'].astype(str)
#creates a date column to match the weather data
df['DATE'] = df.Year + '-' + df.Month + '-' + df.DayofMonth
Then, we renamed the airline columns to match the weather data and concatonated every weather dataset into one large dataset titled 'concat_weather'. Finally, we can merge the two datasets. The first merging was done for the origin airport and merged on 'ORIGIN' and 'DATE' this added weather data for the airline that the flight was departing from. Then, columns were renamed to include that they were associated with the origin to avoid confusion.
For the second merging, the 'ORIGIN' column of the 'concat_weather' dataset was renamed to be 'DEST'. After the columns were prepped, the dataset was merged on 'DEST' and 'DATE' This allowed the same weather data to act as data for both the origin and destination.
#renames column(df)
df = df.rename(columns={'OriginCityName': 'ORIGIN', 'DestCityName':'DEST'})
#one big weather dataset
concat_weather = pd.concat([NYW,SEW,CHW,ORW,DCW,HOW,DEW,ATW,LAW,DAW])
#creates a new column to help wtih merging
concat_weather['ORIGIN'] = concat_weather['NAME']
concat_weather = concat_weather.drop(columns=['STATION','NAME'])
''' first merge is for weather for the origin airport'''
#merges datasets and renames/drops columns
df_merged = df.merge(concat_weather, on=['ORIGIN', 'DATE'])
df_merged['OriginSnow'] = df_merged['SNOW']
df_merged['OriginPrcp'] =df_merged['PRCP']
df_merged = df_merged.drop(columns = ['PRCP', 'SNOW'])
'''second merge is for weather for the destination airport'''
#edits concat_weather for second merge
concat_weather['DEST'] = concat_weather['ORIGIN']
concat_weather['DestSnow_Last2'] = concat_weather['OriginSnow_Last2']
concat_weather['DestPrcp_Last2'] = concat_weather['OriginPrcp_Last2']
concat_weather = concat_weather.drop(columns=['ORIGIN'])
concat_weather = concat_weather.drop(columns=['OriginSnow_Last2','OriginPrcp_Last2' ])
concat_weather
#merges 2nd time and renames/drops columns
df_merged1 = df_merged.merge(concat_weather, on=['DEST', 'DATE'])
df_merged1['DestSnow'] = df_merged1['SNOW']
df_merged1['DestPrcp'] =df_merged1['PRCP']
df_merged1 = df_merged1.drop(columns = ['PRCP', 'SNOW'])
The following cells checks the correctness of merging, crossreferencing the original data present in each individual weather dataset for both the origin airport and the destination airport. We do this at three random places within the entire dataset.
print(df_merged1.iloc[3925350])
print(DEW[DEW['DATE'] == '2023-08-27'])
print(NYW[NYW['DATE'] == '2023-08-27'])
Year 2023 DayofMonth 27 DayOfWeek Sun ORIGIN Denver, CO DEST New York, NY CRSDepTime 2338 DepTime 2351.0 DepDelayMinutes 13.0 TaxiOut 16.0 TaxiIn 7.0 CRSArrTime 520 ArrTime 516.0 ArrDelayMinutes 0.0 Cancelled 0.0 Diverted 0.0 CRSElapsedTime 222.0 ActualElapsedTime 205.0 AirTime 182.0 Distance 1626.0 CarrierDelay 0.0 WeatherDelay 0.0 NASDelay 0.0 SecurityDelay 0.0 LateAircraftDelay 0.0 CancellationReason No Delay Airline JetBlue Airways OriginAirport Denver, CO: Denver International DestAirport New York, NY: John F. Kennedy International NumericMonth 8 datetime_column 2023-08-27 23:38:00 Delayed False Month 08 DATE 2023-08-27 OriginSnow_Last2 0.0 OriginPrcp_Last2 269.0 OriginSnow 0.0 OriginPrcp 0.0 DestSnow_Last2 0.0 DestPrcp_Last2 28.0 DestSnow 0.0 DestPrcp 0.0 Name: 3925350, dtype: object STATION DATE NAME PRCP SNOW OriginSnow_Last2 \ 3890 USW00023062 2023-08-27 Denver, CO 0.0 0.0 0.0 OriginPrcp_Last2 3890 269.0 STATION DATE NAME PRCP SNOW OriginSnow_Last2 \ 3890 USW00094789 2023-08-27 New York, NY 0.0 0.0 0.0 OriginPrcp_Last2 3890 28.0
print(df_merged1.iloc[39256])
print(ATW[ATW['DATE'] == '2013-02-25'])
print(CHW[CHW['DATE'] == '2013-02-25'])
Year 2013 DayofMonth 25 DayOfWeek Mon ORIGIN Atlanta, GA DEST Chicago, IL CRSDepTime 1204 DepTime 1205.0 DepDelayMinutes 1.0 TaxiOut 12.0 TaxiIn 4.0 CRSArrTime 1304 ArrTime 1250.0 ArrDelayMinutes 0.0 Cancelled 0.0 Diverted 0.0 CRSElapsedTime 120.0 ActualElapsedTime 105.0 AirTime 89.0 Distance 606.0 CarrierDelay 0.0 WeatherDelay 0.0 NASDelay 0.0 SecurityDelay 0.0 LateAircraftDelay 0.0 CancellationReason No Delay Airline Mesa Airlines Inc. OriginAirport Atlanta, GA: Hartsfield-Jackson Atlanta Intern... DestAirport Chicago, IL: Chicago O'Hare International NumericMonth 2 datetime_column 2013-02-25 12:04:00 Delayed False Month 02 DATE 2013-02-25 OriginSnow_Last2 0.0 OriginPrcp_Last2 10.0 OriginSnow 0.0 OriginPrcp 10.0 DestSnow_Last2 0.0 DestPrcp_Last2 0.0 DestSnow 0.0 DestPrcp 0.0 Name: 39256, dtype: object STATION DATE NAME PRCP SNOW OriginSnow_Last2 \ 55 USW00013874 2013-02-25 Atlanta, GA 10.0 0.0 0.0 OriginPrcp_Last2 55 10.0 STATION DATE NAME PRCP SNOW OriginSnow_Last2 \ 55 USW00094846 2013-02-25 Chicago, IL 0.0 0.0 0.0 OriginPrcp_Last2 55 0.0
print(df_merged1.iloc[516])
print(LAW[LAW['DATE'] == '2013-01-17'])
print(DEW[DEW['DATE'] == '2013-01-17'])
Year 2013 DayofMonth 17 DayOfWeek Thurs ORIGIN Los Angeles, CA DEST Denver, CO CRSDepTime 1752 DepTime 1745.0 DepDelayMinutes 0.0 TaxiOut 12.0 TaxiIn 7.0 CRSArrTime 2105 ArrTime 2056.0 ArrDelayMinutes 0.0 Cancelled 0.0 Diverted 0.0 CRSElapsedTime 133.0 ActualElapsedTime 131.0 AirTime 112.0 Distance 862.0 CarrierDelay 0.0 WeatherDelay 0.0 NASDelay 0.0 SecurityDelay 0.0 LateAircraftDelay 0.0 CancellationReason No Delay Airline United Air Lines Inc. OriginAirport Los Angeles, CA: Los Angeles International DestAirport Denver, CO: Denver International NumericMonth 1 datetime_column 2013-01-17 17:52:00 Delayed False Month 01 DATE 2013-01-17 OriginSnow_Last2 0.0 OriginPrcp_Last2 0.0 OriginSnow 0.0 OriginPrcp 0.0 DestSnow_Last2 0.0 DestPrcp_Last2 0.0 DestSnow 0.0 DestPrcp 0.0 Name: 516, dtype: object STATION DATE NAME PRCP SNOW OriginSnow_Last2 \ 16 USW00023174 2013-01-17 Los Angeles, CA 0.0 0.0 0.0 OriginPrcp_Last2 16 0.0 STATION DATE NAME PRCP SNOW OriginSnow_Last2 \ 16 USW00023062 2013-01-17 Denver, CO 0.0 0.0 0.0 OriginPrcp_Last2 16 0.0
Finally, we set the index to datetime, and read this merged file into one large final csv file which acts as the input for our machine learning model.
#sets index to datetime column and prints out file to be inputted in ML
df_merged1 = df_merged1.set_index('datetime_column').sort_index()
file_path = "/Users/juliabaratta/Desktop/MLready.csv"
df_merged1.to_csv(file_path, index = False)
To prepare for machine learning, we first input the necessary libraries which include a onehot encoder, column transformer and standard scalar to dummy enocde and scale the data within the pipeline. Then, we import the Random Forest Regressor to act as an extention of the Decision Tree learning method that we learned in class. Finally, we import in the mean_absolute_error and r2_score to evaluate the effectiveness of the model.
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
Here, the cleaned and merged dataset 'MLReady.csv' gets read in. Relevant features are defined as well as the variable that we will be predicting. We are quantitatively predicting the delay time, so we chose a regressor.
It's important to note here that we do include the DepDelayMinutes as a variable in the regression model because it is data that a flyer would know before the flight takes off. We invision this being useful becuase a flight is often delayed incrementally, so if a flyer first hears that their flight has been deleayed 15 minutes, they might want to know the likelyhood that the flight will be delayed further based on other factors about the flight.
df = pd.read_csv('MLReady.csv')
selected_columns = ['Year', 'DayofMonth', 'Month', 'CRSDepTime', 'DepDelayMinutes',
'CRSArrTime', 'Distance', 'OriginSnow_Last2', 'OriginPrcp_Last2', 'OriginSnow',
'OriginPrcp', 'DestSnow_Last2', 'DestPrcp_Last2', 'DestSnow', 'DestPrcp',
'OriginAirport', 'DestAirport', 'DayOfWeek', 'Airline', 'ArrDelayMinutes']
#Excludes other columns
df = df[selected_columns]
We then define the categorical and numerical features of the model. We then created a list to store the evaluation metrics of each year.
categorical_features = ['OriginAirport', 'DestAirport', 'DayOfWeek', 'Airline']
numerical_features = df.drop(columns=['ArrDelayMinutes']).columns.difference(categorical_features)
mae_list = []
r2_list = []
We transformed the categorical variables using One Hot Encoding. We also defined the scaler for the pipeline (Z-standardizing) and the preprocessor to deal with the cateogrical features and create the model.
In order to account for trends the previous year and predict the delays of the next, we implimented a for loop to iterate over each year of the dataset, train a model on that year, and then test the accuracy of that model to predict the next year. This requires a rolling regression model. The model then print its error for the current iteration, then after the for loop terminates, prints the average.
for year in range(2013, 2023):
#creates a training dataset that is one year
train_data = df[df['Year'] == year]
#creates a testing dataset that is the next year
test_data = df[df['Year'] == (year + 1)]
#makes y and X for training
X_train = train_data.drop(columns=['ArrDelayMinutes'])
y_train = train_data['ArrDelayMinutes']
#makes y and X for testing
X_test = test_data.drop(columns=['ArrDelayMinutes'])
y_test = test_data['ArrDelayMinutes']
#creates instances of both a numerical and categorical transformer
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse=False)
#creates a preprocesser to deal with transorming both the num and cat columns
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
#creates a pipeline for the model
model = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', RandomForestRegressor(n_estimators=10, random_state=42))])
#fits model to the years training data
model.fit(X_train, y_train)
#uses this fit to predict on the desting variables
y_pred = model.predict(X_test)
#evaluates the success of that prediction
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mae_list.append(mae)
r2_list.append(r2)
print(f'Training on {year}, Testing on {year + 1} - Mean Absolute Error: {mae}, R-squared: {r2}')
#average efficacy of the model
print(f'Average Mean Absolute Error: {sum(mae_list) / len(mae_list)}')
print(f'Average R-squared: {sum(r2_list) / len(r2_list)}')
Training on 2013, Testing on 2014 - Mean Absolute Error: 6.487981144083558, R-squared: 0.9132239974039758 Training on 2014, Testing on 2015 - Mean Absolute Error: 6.812233708127388, R-squared: 0.914890110438901 Training on 2015, Testing on 2016 - Mean Absolute Error: 5.903338926832098, R-squared: 0.9279673615797543 Training on 2016, Testing on 2017 - Mean Absolute Error: 6.058833778954674, R-squared: 0.9236424608313948 Training on 2017, Testing on 2018 - Mean Absolute Error: 6.42770809330232, R-squared: 0.9233149913249296 Training on 2018, Testing on 2019 - Mean Absolute Error: 6.63172692814846, R-squared: 0.9363628946806113 Training on 2019, Testing on 2020 - Mean Absolute Error: 4.971198419998488, R-squared: 0.916248103806017 Training on 2020, Testing on 2021 - Mean Absolute Error: 5.313820978632428, R-squared: 0.9451646019900752 Training on 2021, Testing on 2022 - Mean Absolute Error: 6.2656328518457025, R-squared: 0.9516300949355766 Training on 2022, Testing on 2023 - Mean Absolute Error: 7.222455108552838, R-squared: 0.9558847170121954 Average Mean Absolute Error: 6.209492993847795 Average R-squared: 0.930832933400343
In the model, "n_estimators = 10" can be increased for a more accurate depiction. However, due to the size of our data, n=10 will take 20+ minutes to load on an average computer so we figured this to be sufficient.
Below is a visualization of the MAE and R^2 values of our model evaluated every year.
# Years for training and testing
years_train = [2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022]
mae_values = [6.487981144083558, 6.812233708127388, 5.903338926832098, 6.058833778954674, 6.42770809330232, 6.63172692814846, 4.971198419998488, 5.313820978632428, 6.2656328518457025, 7.222455108552838]
r_squared_values = [0.9132239974039758, 0.914890110438901, 0.9279673615797543, 0.9236424608313948, 0.9233149913249296, 0.9363628946806113, 0.916248103806017, 0.9451646019900752, 0.9516300949355766, 0.9558847170121954]
# Plot MAE
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(years_train, mae_values, marker='o', linestyle='-', color='b')
plt.title('Mean Absolute Error (MAE)')
plt.xlabel('Training Year')
plt.ylabel('MAE')
plt.ylim(0, 10) # Set y-axis limits
# Plot R-squared
plt.subplot(1, 2, 2)
plt.plot(years_train, r_squared_values, marker='o', linestyle='-', color='r')
plt.title('R-squared')
plt.xlabel('Training Year')
plt.ylabel('R-squared')
plt.ylim(0, 1) # Set y-axis limits
plt.tight_layout()
plt.show()
As we can see, the results are pretty accurate.
To further test the model, we will evaluate two real flights (both of our returning flights for winter break). Because 2023 was an incomplete year, we will train on 2022. Below you will find the input for our flights back home. We will take this input and turn it into a separate DataFrame.
sean_flight_data = {
'Year': 2023,
'DayofMonth': 13,
'Month': 12,
'CRSDepTime': 1759,
'DepDelayMinutes': 0,
'CRSArrTime': 2031,
'Distance': 1946,
'OriginSnow_Last2': 0,
'OriginPrcp_Last2': 0,
'OriginSnow': 0,
'OriginPrcp': 0,
'DestSnow_Last2': 0,
'DestPrcp_Last2': 0,
'DestSnow': 0,
'DestPrcp': 0,
'OriginAirport': 'Atlanta, GA: Hartsfield-Jackson Atlanta International',
'DestAirport': 'Los Angeles, CA: Los Angeles International',
'DayOfWeek': 'Wed',
'Airline': 'Delta Air Lines Inc.'
}
sean_flight_df = pd.DataFrame([sean_flight_data])
julia_flight_data = {
'Year': 2023,
'DayofMonth': 17,
'Month': 12,
'CRSDepTime': 932,
'DepDelayMinutes': 0,
'CRSArrTime': 1315,
'Distance': 541.79,
'OriginSnow_Last2': 0,
'OriginPrcp_Last2': 0,
'OriginSnow': 0,
'OriginPrcp': 0,
'DestSnow_Last2': 0,
'DestPrcp_Last2': 1955,
'DestSnow': 0,
'DestPrcp': 1955,
'OriginAirport': 'Atlanta',
'DestAirport': 'Washington, DC: Washington Dulles International',
'DayOfWeek': 'Sun',
'Airline': 'Frontier Airlines Inc.'
}
julia_flight_df = pd.DataFrame([julia_flight_data])
We will use the same model as before. Here we can see that Sean's flight was predicted to be delayed 3 minutes and Julia's was predicted to be delayed 11 minutes. We attribute this larger delay to the rain that was predicted to be in Washington DC on the day of travel (determined through a weather app).
Notably: because New Orleans, and Philadelphia (Julia's Destination) were not cities we considered, we replaced those with Atlanta and Washinton DC respectivly (weather in ATL and DC were used as well to keep it accurate). This highlights one fault of our model: that it only considers the 10 most popular airports.
#df = pd.read_csv('MLReady.csv')
#relevant columns
selected_columns = ['Year', 'DayofMonth', 'Month', 'CRSDepTime', 'DepDelayMinutes',
'CRSArrTime', 'Distance', 'OriginSnow_Last2', 'OriginPrcp_Last2', 'OriginSnow',
'OriginPrcp', 'DestSnow_Last2', 'DestPrcp_Last2', 'DestSnow', 'DestPrcp',
'OriginAirport', 'DestAirport', 'DayOfWeek', 'Airline', 'ArrDelayMinutes']
df = df[selected_columns]
#transforming the columns
categorical_features = ['OriginAirport', 'DestAirport', 'DayOfWeek', 'Airline']
numerical_features = df.drop(columns=['ArrDelayMinutes']).columns.difference(categorical_features)
#training on 2022
train_data = df[df['Year'] == 2022]
X_train = train_data.drop(columns=['ArrDelayMinutes'])
y_train = train_data['ArrDelayMinutes']
#preprocessing
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse=False)
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
#initiating the model
model = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', RandomForestRegressor(n_estimators=10, random_state=42))])
#fitting the model on the training data (2022)
model.fit(X_train, y_train)
#printing the result
julia_delay_prediction = model.predict(julia_flight_df)
print(f"Julia's Flight Delay Prediction: {julia_delay_prediction[0]} minutes")
sean_delay_prediction = model.predict(sean_flight_df)
print(f"Sean's Flight Delay Prediction: {sean_delay_prediction[0]} minutes")
Julia's Flight Delay Prediction: 10.6 minutes Sean's Flight Delay Prediction: 3.1 minutes