Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterate over rows and expand pandas dataframe

I have pandas dataframe with a column containing values or lists of values (of unequal length). I want to 'expand' the rows, so each value in the list becomes single value in column. An example says it all:

dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
 u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ]})

    location     name
0   Amsterdam   Tom
1   [Berlin, Paris] Jim
2   [Antwerp, Barcelona, Pisa]  Claus

I want to turn into:

dfOut = pd.DataFrame({u'name': ['Tom', 'Jim', 'Jim', 'Claus','Claus','Claus'],
u'location': ['Amsterdam', 'Berlin','Paris', 'Antwerp','Barcelona','Pisa']})

    location     name
0   Amsterdam   Tom
1   Berlin   Jim
2   Paris   Jim
3   Antwerp Claus
4   Barcelona   Claus
5   Pisa    Claus

I first tried using apply but it's not possible to return multiple Series as far as I know. iterrows seems to be the trick. But the code below gives me an empty dataframe...

def duplicator(series):
    if type(series['location']) == list:
        for location in series['location']:
            subSeries = series
            subSeries['location'] = location
            dfOut.append(subSeries)
    else:
        dfOut.append(series)

for index, row in dfIn.iterrows():
    duplicator(row)
like image 208
bowlby Avatar asked Sep 26 '14 20:09

bowlby


1 Answers

Not as much interesting/fancy pandas usage, but this works:

import numpy as np
dfIn.loc[:, 'location'] = dfIn.location.apply(np.atleast_1d)
all_locations = np.hstack(dfIn.location)
all_names = np.hstack([[n]*len(l) for n, l in dfIn[['name', 'location']].values])
dfOut = pd.DataFrame({'location':all_locations, 'name':all_names})

It's about 40x faster than the apply/stack/reindex approach. As far as I can tell, that ratio holds at pretty much all dataframe sizes (didn't test how it scales with the size of the lists in each row). If you can guarantee that all location entries are already iterables, you can remove the atleast_1d call, which gives about another 20% speedup.

like image 72
MorganM Avatar answered Oct 05 '22 15:10

MorganM