Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas fill forward performance issue

I have a dataframe with a multi index (Date, InputTime) and this dataframe may contain some NA values in the columns (Value, Id). I want to fill forward value but by Date only and I don't find anyway to do this a in a very efficient way.

Here is the type of dataframe I have :

Dataframe example

And here is the result I want :

Dataframe properly fill forward by date only

So to properly fillback by date I can use groupby(level=0) function. The groupby is fast but the fill function apply on the dataframe group by date is really too slow.

Here is the code I use to compare simple fill forward (which doesn't give the expected result but is run very quickly) and expected fill forward by date (which give expected result but is really too slow).

import numpy as np
import pandas as pd
import datetime as dt

# Show pandas & numpy versions
print('pandas '+pd.__version__)
print('numpy '+np.__version__)

# Build a big list of (Date,InputTime,Value,Id)
listdata = []
d = dt.datetime(2001,10,6,5)
for i in range(0,100000):
    listdata.append((d.date(), d, 2*i if i%3==1 else np.NaN, i if i%3==1 else np.NaN))
    d = d + dt.timedelta(hours=8)

# Create the dataframe with Date and InputTime as index
df = pd.DataFrame.from_records(listdata, index=['Date','InputTime'], columns=['Date', 'InputTime', 'Value', 'Id'])

# Simple Fill forward on index
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].ffill()
end = dt.datetime.now()
print "Time to fill forward on index = " + str((end-start).total_seconds()) + " s"

# Fill forward on Date (first level of index)
start = dt.datetime.now()
for col in df.columns:
    df[col] = df[col].groupby(level=0).ffill()
end = dt.datetime.now()
print "Time to fill forward on Date only = " + str((end-start).total_seconds()) + " s"

Results

Could somebody explain me why this code is so slow or help me to find an efficient way to fill forward by date on big dataframes?

Thanks

like image 544
RitonLaJoie Avatar asked Oct 30 '22 17:10

RitonLaJoie


1 Answers

github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group. here's an easy way to do this. url:https://github.com/pandas-dev/pandas/issues/11296

according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:

df = df.sort_index()
df.ffill() * (1 - df.isnull().astype(int)).groupby(level=0).cumsum().applymap(lambda x: None if x == 0 else 1)

utility function: (credit to @Phun)

def ffill_se(df: pd.DataFrame, group_cols: List[str]):
    df['GROUP'] = df.groupby(group_cols).ngroup()
    df.set_index(['GROUP'], inplace=True)
    df.sort_index(inplace=True)
    df = df.ffill() * (1 - df.isnull().astype(int)).groupby(level=0).cumsum().applymap(lambda x: None if x == 0 else 1)
    df.reset_index(inplace=True, drop=True)
    return df
like image 103
xmduhan Avatar answered Nov 09 '22 03:11

xmduhan