Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the closest match based on 2 keys from one dataframe to another?

I have 2 dataframes I'm working with. One has a bunch of locations and coordinates (longitude, latitude). The other is a weather data set with data from weather stations all over the world and their respective coordinates. I am trying to link up the nearest weather station to each location in my data set. The weather station names and my location names are not matches.

I am left trying to link them by closest match in coordinates and have no idea where to begin.

I was thinking some use of

np.abs((location['latitude']-weather['latitude'])+(location['longitude']-weather['longitude'])

Examples of each

location...

Location   Latitude   Longitude Component  \
     A  39.463744  -76.119411    Active   
     B  39.029252  -76.964251    Active   
     C  33.626946  -85.969576    Active   
     D  49.286337   10.567013    Active   
     E  37.071777  -76.360785    Active   

weather...

     Station Code             Station Name  Latitude  Longitude
     US1FLSL0019    PORT ST. LUCIE 4.0 NE   27.3237   -80.3111
     US1TXTV0133            LAKEWAY 2.8 W   30.3597   -98.0252
     USC00178998                  WALTHAM   44.6917   -68.3475
     USC00178998                  WALTHAM   44.6917   -68.3475
     USC00178998                  WALTHAM   44.6917   -68.3475

Output would be a new column on the location dataframe with the station name that is the closest match

However I am not sure how to loop thru both to accomplish this. Any help would be greatly appreciated..

Thanks, Scott

like image 865
sokeefe1014 Avatar asked Apr 25 '16 14:04

sokeefe1014


1 Answers

Let's say you have a distance function dist that you want to minimize:

def dist(lat1, long1, lat2, long2):
    return np.abs((lat1-lat2)+(long1-long2))

For a given location, you can find the nearest station as follows:

lat = 39.463744
long = -76.119411
weather.apply(
    lambda row: dist(lat, long, row['Latitude'], row['Longitude']), 
    axis=1)

This will calculate the distance to all weather stations. Using idxmin you can find the closest station name:

distances = weather.apply(
    lambda row: dist(lat, long, row['Latitude'], row['Longitude']), 
    axis=1)
weather.loc[distances.idxmin(), 'StationName']

Let's put all this in a function:

def find_station(lat, long):
    distances = weather.apply(
        lambda row: dist(lat, long, row['Latitude'], row['Longitude']), 
        axis=1)
    return weather.loc[distances.idxmin(), 'StationName']

You can now get all the nearest stations by applying it to the locations dataframe:

locations.apply(
    lambda row: find_station(row['Latitude'], row['Longitude']), 
    axis=1)

Output:

0         WALTHAM
1         WALTHAM
2    PORTST.LUCIE
3         WALTHAM
4    PORTST.LUCIE
like image 149
IanS Avatar answered Nov 12 '22 04:11

IanS