I got a dataframe that contains places with their latitude and longitude. Imagine for example cities.
df = pd.DataFrame([{'city':"Berlin", 'lat':52.5243700, 'lng':13.4105300},
{'city':"Potsdam", 'lat':52.3988600, 'lng':13.0656600},
{'city':"Hamburg", 'lat':53.5753200, 'lng':10.0153400}]);
Now I'm trying to get all cities in a radius around another. Let's say all cities in a distance of 500km from Berlin, 500km from Hamburg and so on. I would do this by duplicating the original dataframe and joining both with a distance-function.
The intermediate result would be somewhat like this:
Berlin --> Potsdam
Berlin --> Hamburg
Potsdam --> Berlin
Potsdam --> Hamburg
Hamburg --> Potsdam
Hamburg --> Berlin
This final result after grouping (reducing) should be like this. Remark: Would be cool if the list of values includes all columns of the city.
Berlin --> [Potsdam, Hamburg]
Potsdam --> [Berlin, Hamburg]
Hamburg --> [Berlin, Potsdam]
Or just the count of cities 500km around one city.
Berlin --> 2
Potsdam --> 2
Hamburg --> 2
Since I'm quite new to Python, I would appreciate any starting point. I'm familiar with haversine distance. But not sure if there are useful distance/spatial methods in Scipy or Pandas.
Glad if you can give me a starting point. Up to now I tried following this post.
Update: The original idea behind this question comes from the Two Sigma Connect Rental Listing Kaggle Competition. The idea is to get those listing 100m around another listing. Which a) indicates a density and therefore a popular area and b) if the addresses are compares, you can find out if there is a crossing and therefore a noisy area. Therefore you not need the full item to item relation since you need to compare not only the distance but also the address and other meta-data. PS: I'm not uploading a solution to Kaggle. I just want to learn.
Let's say you already have a pandas DataFrame with few columns and you would like to add/merge Series as columns into existing DataFrame, this is certainly possible using pandas. Dataframe. merge() method.
Both join and merge can be used to combines two dataframes but the join method combines two dataframes on the basis of their indexes whereas the merge method is more versatile and allows us to specify columns beside the index to join on for both dataframes.
The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
You can use:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
First need cross join with merge
, remove row with same values in city_x
and city_y
by boolean indexing
:
df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
print (df)
city_x lat_x lng_x tmp city_y lat_y lng_y
1 Berlin 52.52437 13.41053 1 Potsdam 52.39886 13.06566
2 Berlin 52.52437 13.41053 1 Hamburg 53.57532 10.01534
3 Potsdam 52.39886 13.06566 1 Berlin 52.52437 13.41053
5 Potsdam 52.39886 13.06566 1 Hamburg 53.57532 10.01534
6 Hamburg 53.57532 10.01534 1 Berlin 52.52437 13.41053
7 Hamburg 53.57532 10.01534 1 Potsdam 52.39886 13.06566
Then apply haversine function:
df['dist'] = df.apply(lambda row: haversine(row['lng_x'],
row['lat_x'],
row['lng_y'],
row['lat_y']), axis=1)
Filter distance:
df = df[df.dist < 500]
print (df)
city_x lat_x lng_x tmp city_y lat_y lng_y dist
1 Berlin 52.52437 13.41053 1 Potsdam 52.39886 13.06566 27.215704
2 Berlin 52.52437 13.41053 1 Hamburg 53.57532 10.01534 255.223782
3 Potsdam 52.39886 13.06566 1 Berlin 52.52437 13.41053 27.215704
5 Potsdam 52.39886 13.06566 1 Hamburg 53.57532 10.01534 242.464120
6 Hamburg 53.57532 10.01534 1 Berlin 52.52437 13.41053 255.223782
7 Hamburg 53.57532 10.01534 1 Potsdam 52.39886 13.06566 242.464120
And last create list
or get size
with groupby
:
df1 = df.groupby('city_x')['city_y'].apply(list)
print (df1)
city_x
Berlin [Potsdam, Hamburg]
Hamburg [Berlin, Potsdam]
Potsdam [Berlin, Hamburg]
Name: city_y, dtype: object
df2 = df.groupby('city_x')['city_y'].size()
print (df2)
city_x
Berlin 2
Hamburg 2
Potsdam 2
dtype: int64
Also is possible use numpy haversine solution
:
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
#print (df)
df['dist'] = haversine_np(df['lng_x'],df['lat_x'],df['lng_y'],df['lat_y'])
city_x lat_x lng_x tmp city_y lat_y lng_y dist
1 Berlin 52.52437 13.41053 1 Potsdam 52.39886 13.06566 27.198616
2 Berlin 52.52437 13.41053 1 Hamburg 53.57532 10.01534 255.063541
3 Potsdam 52.39886 13.06566 1 Berlin 52.52437 13.41053 27.198616
5 Potsdam 52.39886 13.06566 1 Hamburg 53.57532 10.01534 242.311890
6 Hamburg 53.57532 10.01534 1 Berlin 52.52437 13.41053 255.063541
7 Hamburg 53.57532 10.01534 1 Potsdam 52.39886 13.06566 242.311890
UPDATE: I would suggest first to buiuld a distance DataFrame:
from scipy.spatial.distance import squareform, pdist
from itertools import combinations
# see definition of "haversine_np()" below
x = pd.DataFrame({'dist':pdist(df[['lat','lng']], haversine_np)},
index=pd.MultiIndex.from_tuples(tuple(combinations(df['city'], 2))))
efficiently produces pairwise distance DF (without duplicates):
In [106]: x
Out[106]:
dist
Berlin Potsdam 27.198616
Hamburg 255.063541
Potsdam Hamburg 242.311890
Old answer:
Here is a bit optimized version, which uses scipy.spatial.distance.pdist method:
from scipy.spatial.distance import squareform, pdist
# slightly modified version: of http://stackoverflow.com/a/29546836/2901002
def haversine_np(p1, p2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lat1, lon1, lat2, lon2 = np.radians([p1[0], p1[1],
p2[0], p2[1]])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
x = pd.DataFrame(squareform(pdist(df[['lat','lng']], haversine_np)),
columns=df.city.unique(),
index=df.city.unique())
this gives us:
In [78]: x
Out[78]:
Berlin Potsdam Hamburg
Berlin 0.000000 27.198616 255.063541
Potsdam 27.198616 0.000000 242.311890
Hamburg 255.063541 242.311890 0.000000
let's count number of cities where the distance is greater than 30:
In [81]: x.groupby(level=0, as_index=False) \
...: .apply(lambda c: c[c>30].notnull().sum(1)) \
...: .reset_index(level=0, drop=True)
Out[81]:
Berlin 1
Hamburg 2
Potsdam 1
dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With