I have a df:
import pandas as pd
import numpy as np
import datetime as DT
import hmac
from geopy.geocoders import Nominatim
from geopy.distance import vincenty
df
city_name state_name county_name
0 WASHINGTON DC DIST OF COLUMBIA
1 WASHINGTON DC DIST OF COLUMBIA
2 WASHINGTON DC DIST OF COLUMBIA
3 WASHINGTON DC DIST OF COLUMBIA
4 WASHINGTON DC DIST OF COLUMBIA
5 WASHINGTON DC DIST OF COLUMBIA
6 WASHINGTON DC DIST OF COLUMBIA
7 WASHINGTON DC DIST OF COLUMBIA
8 WASHINGTON DC DIST OF COLUMBIA
9 WASHINGTON DC DIST OF COLUMBIA
I want to get the latitude and longitude coordinates for any one of the columns in the data frame below. The documentation (http://geopy.readthedocs.org/en/latest/#data) is pretty straightforward when working with the documentation for individual locations.
>>> from geopy.geocoders import Nominatim
>>> geolocator = Nominatim()
>>> location = geolocator.geocode("175 5th Avenue NYC")
>>> print(location.address)
Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York, ...
>>> print((location.latitude, location.longitude))
(40.7410861, -73.9896297241625)
>>> print(location.raw)
{'place_id': '9167009604', 'type': 'attraction', ...}
However I want to apply the function to each row in the df and make a new column. I've tried the following
df['city_coord'] = geolocator.geocode(lambda row: 'state_name' (row))
but I think I'm missing something in my code because I get the following:
city_name state_name county_name coordinates
0 WASHINGTON DC DIST OF COLUMBIA None
1 WASHINGTON DC DIST OF COLUMBIA None
2 WASHINGTON DC DIST OF COLUMBIA None
3 WASHINGTON DC DIST OF COLUMBIA None
4 WASHINGTON DC DIST OF COLUMBIA None
5 WASHINGTON DC DIST OF COLUMBIA None
6 WASHINGTON DC DIST OF COLUMBIA None
7 WASHINGTON DC DIST OF COLUMBIA None
8 WASHINGTON DC DIST OF COLUMBIA None
9 WASHINGTON DC DIST OF COLUMBIA None
I would like something like this hopefully using the Lambda function:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
1 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
2 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
3 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
4 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
5 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
6 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
7 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
8 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
9 WASHINGTON DC DIST OF COLUMBIA 38.8949549, -77.0366456
10 GLYNCO GA GLYNN 31.2224512, -81.5101023
I appreciate any help. After I get the coordinates I'd like to map them. Any recommended resources for mapping coordinates is greatly appreciated too. thanks
You can call apply
and pass the function you want to execute on every row like the following:
In [9]:
geolocator = Nominatim()
df['city_coord'] = df['state_name'].apply(geolocator.geocode)
df
Out[9]:
city_name state_name county_name \
0 WASHINGTON DC DIST OF COLUMBIA
1 WASHINGTON DC DIST OF COLUMBIA
city_coord
0 (District of Columbia, United States of Americ...
1 (District of Columbia, United States of Americ...
You can then access the latitude and longitude attributes:
In [16]:
df['city_coord'] = df['city_coord'].apply(lambda x: (x.latitude, x.longitude))
df
Out[16]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
Or do it in a one liner by calling apply
twice:
In [17]:
df['city_coord'] = df['state_name'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df
Out[17]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
Also your attempt geolocator.geocode(lambda row: 'state_name' (row))
did nothing hence why you have a column full of None
values
EDIT
@leb makes an interesting point here, if you have many duplicate values then it'll be more performant to geocode for each unique value and then add this:
In [38]:
states = df['state_name'].unique()
d = dict(zip(states, pd.Series(states).apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))))
d
Out[38]:
{'DC': (38.8937154, -76.9877934586326)}
In [40]:
df['city_coord'] = df['state_name'].map(d)
df
Out[40]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
So the above gets all the unique values using unique
, constructs a dict from them and then calls map
to perform the lookup and add the coords, this will be more efficient than trying to geocode row-wise
Upvote and accept @EdChum's answer, I just wanted to add to this. His methods works perfect, but from personal experience I'd like to share a few things:
When dealing with geocoding, if you have multiple city/state combination that are repeating, it's much faster to send only 1 to get geocoded and then replicate the rest to other rows below:
This is very helpful for large data can be done through two ways:
drop_duplicate
group_by
the city/state combination, apply geocoding to it the first one by calling head(1)
, then duplicate to the remainder rows.Reason is each time you call on Nominatim there's a small latency issue even if you were queuing the same city/state in a row. This small latency gets worse when your data gets large causing a huge delay in response and possible time out.
Again, this is all from personanly dealing with it. Just keep in mind for future use if it doesn't benefit you now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With