Fillna (forward fill) on a large dataframe efficiently with groupby?

Question

What is the most efficient way to forward fill information in a large dataframe?

I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.

Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?

id       start_date   end_date    is_current  location  dimensions...
xyz987   2016-03-11   2016-04-02  Expired       CA      lots_of_stuff
xyz987   2016-04-03   2016-04-21  Expired       NaN     lots_of_stuff
xyz987   2016-04-22          NaN  Current       CA      lots_of_stuff

That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.

I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?

cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)

There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a

df.fillna(method='ffill', inplace=True)

but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).

bbaker · Accepted Answer

It is likely efficient to execute the fillna directly on the groupby object:

df = df.groupby(['id']).fillna(method='ffill')

Method referenced here in documentation.

Alexander · Answer

How about forward filling each group?

 df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())

xmduhan · Answer

github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group. here's an easy way to do this. url:https://github.com/pandas-dev/pandas/issues/11296

according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:

df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)

Fillna (forward fill) on a large dataframe efficiently with groupby?

Tags:

pandas

trench

3 Answers

bbaker

Alexander

according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:

xmduhan

Recent Activity

Donate For Us

Fillna (forward fill) on a large dataframe efficiently with groupby?

Tags:

pandas

trench

3 Answers

bbaker

Alexander

according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:

xmduhan

Related questions

Recent Activity

Donate For Us