Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fillna (forward fill) on a large dataframe efficiently with groupby?

Tags:

pandas

What is the most efficient way to forward fill information in a large dataframe?

I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.

Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?

id       start_date   end_date    is_current  location  dimensions...
xyz987   2016-03-11   2016-04-02  Expired       CA      lots_of_stuff
xyz987   2016-04-03   2016-04-21  Expired       NaN     lots_of_stuff
xyz987   2016-04-22          NaN  Current       CA      lots_of_stuff

That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.

I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?

cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)

There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a

df.fillna(method='ffill', inplace=True)

but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).

like image 247
trench Avatar asked Apr 26 '16 17:04

trench


3 Answers

It is likely efficient to execute the fillna directly on the groupby object:

df = df.groupby(['id']).fillna(method='ffill')

Method referenced here in documentation.

like image 106
bbaker Avatar answered Nov 20 '22 23:11

bbaker


How about forward filling each group?

 df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
like image 33
Alexander Avatar answered Nov 20 '22 21:11

Alexander


github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group. here's an easy way to do this. url:https://github.com/pandas-dev/pandas/issues/11296

according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:

df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)
like image 3
xmduhan Avatar answered Nov 20 '22 22:11

xmduhan