Anyone can just google the nearest airport of any area. But can you do that for more than 40,000 locations? Here's where Python can help. Sample data covers all airports and zipcodes from the United States.
Python is not just useful in building web applications and automating workflows, but also for gaining business insights too.
I work in analytics for Sales. Part of the job is to help sales teams find strategic locations to focus their selling energies on.
One criteria of our search is accessibility: Can a sales rep travel to a target location with ease? I calculated the nearest airports of each US zipcode to find out.
The idea is: the location is more accessible if there's an airport nearby.
Overview
I created a Python script that calculates the nearest airports of all 40,943 US zipcodes using airport and zipcode data that are available for public use.
I used the Haversine formula in calculating the nearest distance.
Data Sources
- World Airports - I got my data from Ourairports.com
- Zipcodes - This dataset is the most recent one I found: AggData.con
Process
Here's the Git repository: Github
But to understand my logic, read on...
- Required Python Packages
- Step 1. Clean Data Sources
- Step 2. Calculate Nearest Airport by Zipcode
- Step 3. Loop calculation for all Zipcodes
- Step 4. Run script
- Finale: Code base (Pastebin) - Go here if you just want to play with the code on your own!
- Sample Data - Find out the nearest airport of Beverly Hills, LA!
Required Python Packages
import os
import numpy as np
import pandas as pd
from math import cos, asin, sqrt
import csv
from pathlib import Path
from timeit import default_timer as timer
from datetime import datetime
I mainly use numpy and pandas to clean and filter our data sources. The math package is used for the actual calculation. Os, csv and pathlib to write the output. The rest for logging.
Step 1: Clean Data Sources
For data cleaning, I used pandas to do the following:
- Create two dataframes, one for each data source
- Remove unnecessary columns
- Only include large, medium and small airports that are in the United States (Assigned the filtered result to another dataframe)
- Create another column based on existing column.
df_airports = pd.read_csv(FILE_AIRPORTS,encoding = "ISO-8859-1")
columns_to_drop = ['elevation_ft', 'scheduled_service', 'gps_code',
'home_link', 'wikipedia_link', 'keywords', 'score',
'last_updated']
df_airports.drop(columns_to_drop, axis=1, inplace=True)
df_airports_filter = df_airports[(df_airports['iso_country']=='US') & (df_airports['type'].isin(['large_airport','medium_airport','small_airport']))]
df_airports_filter = df_airports_filter.copy()
df_airports_filter.loc[:,'iso_state'] = df_airports_filter['iso_region'].str.split('-').str[1]
df_zipcodes = pd.read_csv(FILE_ZIPCODES,encoding = "ISO-8859-1")
I also created helper methods to retrieve data that I will need later:
def getAllStates():
return df_airports_filter['iso_state'].unique()
def getAirports(state):
df = df_airports_filter[df_airports_filter['iso_state']==state]
return df.to_dict('records')
def getZipcodes(state):
df = df_zipcodes[(df_zipcodes['State Abbreviation']==state)]
return df.to_dict('records')
def getInfo(state):
print(len(getAirports(state)), "airports in", state)
print(len(getZipcodes(state)), "zipcode in", state)
Step 2: Calculate Nearest Airport by Zipcode
The Haversine formula is one way of calculating the distance between two points: the latitude-longtitude of the zipcode and airport.
def distance(lat1, lon1, lat2, lon2):
p = 0.017453292519943295 #Pi/180
a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
return 12742 * asin(sqrt(a)) #2*R*asin..
For each zipcode, the script will calculate its distance to 14,693 airports in the US. To get the nearest airport, here's the method:
def closest(data, zipcode):
dl = []
for p in data:
ap = {
'zipcode': zipcode['Zip Code'],
'country': zipcode['Country'],
'state': zipcode['State Abbreviation'],
'state_full': zipcode['State'],
'county': zipcode['County'],
'latitude-zip': zipcode['Latitude'],
'longitude-zip': zipcode['Longitude'],
'nearest-airport': p['ident'],
'latitude-air': p['latitude_deg'],
'longitude-air': p['longitude_deg'],
'distance': distance(zipcode['Latitude'],zipcode['Longitude'],p['latitude_deg'],p['longitude_deg'])
}
dl.append(ap)
dl_sorted = sorted(dl, key=lambda k: k['distance'])
writeZipsToCSV(dl_sorted,zipcode['State Abbreviation'],zipcode['Zip Code'])
return dl_sorted[0]
The closest
method returns the calculation with the shortest distance (return dl_sorted[0]
)
writeZipsToCSV(dl_sorted,zipcode['State Abbreviation'],zipcode['Zip Code']
To validate my assumption, I also opted to print out all the calculated airport distances of each zipcode.
def writeZipsToCSV(dl_sorted,state,zipcode):
output_folder = "Output/"+state+"/"
if not os.path.exists(output_folder):
os.makedirs(output_folder)
with open(output_folder+str(zipcode)+"_all airports.csv","w") as csv_file:
dict_writer = csv.DictWriter(csv_file, dl_sorted[0].keys())
dict_writer.writeheader()
dict_writer.writerows(dl_sorted)
csv_file.close()
Step 3: Loop calculation for all zipcodes in a state
For each state, the script will do the ff:
- Retrieve its zipcodes from the declared dataframe:
zipcodes = getZipcodes(state)
- Retrieve the nearest airport for each zipcode in a dictionary:
dicts.append(closest(getAirports(state), zc))
- Write this dictionary in a csv file.
entries = 0;
i = datetime.now()
timestamp = i.strftime('%Y-%m%d-')
def calculateNearestAirport(state):
global entries
try:
zipcodes = getZipcodes(state)
dicts = []
print("Calculating for",state,"with", len(zipcodes), "zipcodes...")
for zc in zipcodes:
dicts.append(closest(getAirports(state), zc))
with open("Output/"+timestamp+state+"_nearest_airport.csv","w") as csv_file:
dict_writer = csv.DictWriter(csv_file, dicts[0].keys())
dict_writer.writeheader()
dict_writer.writerows(dicts)
csv_file.close()
finally:
entries = entries + len(zipcodes)
print("Done calculating for ", len(zipcodes), "zipcodes of", state)
Step 4: Run script in terminal
In the first line, I defined the included states in the calculation. For this example, the script will only include the first entry, which is California.
I also included the timer()
function to measure performance.
states_scope = getAllStates()[0:1]
perf_time = []
try:
start = timer()
print("Calculating for the following states: ")
[print (x) for x in states_scope]
for state in states_scope:
start_state = timer()
calculateNearestAirport(state)
end_state = timer()
diff = (end_state-start_state)
time_state={
'state': state,
'duration': round(diff/60,3)
}
perf_time.append(time_state)
finally:
end = timer()
print(round((end - start)/60,3), "minutes")
print(len(perf_time), "states")
print(entries, "zipcodes")
for k in perf_time:
print(k)
Here's how it looks like in the terminal:
It took the script around 1 minute to calculate for 1 state with 2590 zipcodes. Not bad, compared to googling those zipcodes one by one!
Finale: Code base
Remember to update the inputs to your folder destination:
Sample Data: Beverly Hills, LA
Now, let's contextualize our script to actual business data.
Out of 558 airports in California, what's the nearest airport to Beverly Hills, LA?
zipcode | country | state | state_full | county | latitude-zip | longitude-zip |
90210 | US | CA | California | Los Angeles | 34.0901 | -118.4065 |
Based on the script, the nearest airport to 90210 is: Santa Monica Municipal Airport (KSMO)
nearest-airport | latitude-air | longitude-air | distance (km) |
KSMO | 34.01580048 | -118.4509964 | 9.222835746 |
Let's validate the model by plotting in Google Maps:
The black line indicates the distance of 9.20 km from 90210 to the airport, which is close to 9.22km!
Note that the formula doesn't consider the actual roads in the location. Haversine simply calculates the distance from point A to point B.
Now, here's the second nearest airport: Bob Hope Airport (KBUR)
nearest-airport | latitude-air | longitude-air | distance (km) |
KBUR | 34.20069885 | -118.3590012 | 13.05176636 |
- Distance based on Haversine: 13.05 km
- Distance based on Google Maps: 13.03 km