Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas 'apply' returns series; can't convert to dataframe

OK, I'm at half-wit's end. I'm geocoding a dataframe with geopy. I've written a simple function to take an input - country name - and return the latitude and longitude. I use apply to run the function and it returns a Pandas series object. I can't seem to convert it to a dataframe. I'm sure I'm missing something obvious, but I'm new to python and still RTFMing. BTW, the geocoder function works great.

# Import libraries 
import os 
import pandas as pd 
import numpy as np
from geopy.geocoders import Nominatim

def locate(x):
    geolocator = Nominatim()
    # print(x) # debug
    try:
        #Get geocode
        location = geolocator.geocode(x, timeout=8, exactly_one=True)
        lat = location.latitude
        lon = location.longitude
    except:
        #didn't work for some reason that I really don't care about
        lat = np.nan
        lon = np.nan
   #  print(lat,lon) #debug
    return lat,  lon # Note: also tried return { 'LAT': lat, 'LON': lon }

df_geo_in = df_addr.drop_duplicates(['COUNTRY']).reset_index()    #works perfectly
df_geo_in['LAT'], df_geo_in['LON']  = df_geo_in.applymap(locate) 
# error: returns more than 2 values - default index + column with results

I also tried

df_geo_in['LAT','LON'] = df_geo_in.applymap(locate)

I get a single dataframe with no index and a single colume with the series in it.

I've tried a number of other methods, including 'applymap' :

source_cols = ['LAT','LON'] 
new_cols = [str(x) for x in source_cols]

df_geo_in = df_addr.drop_duplicates(['COUNTRY']).set_index(['COUNTRY']) 
df_geo_in[new_cols] = df_geo_in.applymap(locate)

which returned an error after a long time:

ValueError: Columns must be same length as key

I've also tried manually converting the series to a dataframe using the df.from_dict(df_geo_in) method without success.

The goal is to geocode 166 unique countries, then join it back to the 188K addresses in df_addr. I'm trying to be pandas-y in my code and not write loops if possible. But I haven't found the magic to convert series into dataframes and this is the first time I've tried to use apply.

Thanks in advance - ancient C programmer

like image 271
Harvey Avatar asked Mar 31 '15 02:03

Harvey


People also ask

Can you convert series to DataFrame in Python?

to_frame() function is used to convert the given series object to a dataframe. Parameter : name : The passed name should substitute for the series name (if it has one).

Why not use apply in pandas?

It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary.

Can DataFrame apply return multiple columns?

Return Multiple Columns from pandas apply() You can return a Series from the apply() function that contains the new data. pass axis=1 to the apply() function which applies the function multiply to each row of the DataFrame, Returns a series of multiple columns from pandas apply() function.

How does apply work in pandas?

The apply() function is used to apply a function along an axis of the DataFrame. Objects passed to the function are Series objects whose index is either the DataFrame's index (axis=0) or the DataFrame's columns (axis=1).


1 Answers

I'm assuming that df_geo is a df with a single column so I believe the following should work:

change:

return lat,  lon

to

return pd.Series([lat,  lon])

then you should be able to assign like so:

df_geo_in[['LAT', 'LON']] = df_geo_in.apply(locate)

What you tried to do was assign the result of applymap to 2 new columns which is incorrect here as applymap is designed to work on every element in a df so unless the lhs has the same expected shape this won't give the desired result.

Your latter method is also incorrect because you drop the duplicate countries and then expect this to assign every country geolocation back but the shape is different.

It is probably quicker for large df's to create the geolocation non-duplicated df's and then merge this back to your larger df like so:

geo_lookup = df_addr.drop_duplicates(['COUNTRY'])
geo_lookup[['LAT','LNG']] = geo_lookup['COUNTRY'].apply(locate)
df_geo_in.merge(geo_lookup, left_on='COUNTRY', right_on='COUNTRY', how='left')

this will create a df with non duplicated countries with geo location addresses and then we perform a left merge back to the master df.

like image 114
EdChum Avatar answered Sep 21 '22 11:09

EdChum