Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the functionality of the filling method when reindexing?

Tags:

pandas

When reindexing, say, 1 minute data to daily data (e.g. and index for daily prices at 16:00), if there is a situation that there is no 1 minute data for the 16:00 timestamp on a day, we would want to forward fill from the last non-null 1min data. In the following case, there is no 1min data before 16:00 on the 13th, and the last 1min data comes from 10th.

When using reindex with method='ffill', wouldn't one expect the following code to fill in the value on the 13th at 16:00? Inspecting daily1 shows that it is missing however.

import pandas as pd
import numpy as np

hf_index = pd.date_range(start='2013-05-09 9:00', end='2013-05-13 23:59', freq='1min')
hf_prices = np.random.rand(len(hf_index))
hf = pd.DataFrame(hf_prices, index=hf_index)
hf.ix['2013-05-10 18:00':'2013-05-13 18:00',:]=np.nan
hf.plot()

ind_daily = pd.date_range(start='2013-05-09 16:00', end='2013-05-13 16:00', freq='B')

print(ind_daily.values)
daily1 = hf.reindex(index=ind_daily, method='ffill')

To fill as one (or rather I) would expect, I need to do this:

daily2 = daily1.fillna(method='ffill')

If this is the case, what is the fill method in reindex actually doing. It is not clear to me just from the pandas documentation. It seems to me I should not have to do the above line.

like image 414
user915 Avatar asked Nov 10 '22 16:11

user915


1 Answers

I write my comment on the github here as well:

The current behavior in my opinion makes more sense. 'nan' values can be valid "actual" values in some scenarios. the concept of an actual 'nan' value should be different from 'nan' value because of changing index. If I have a dataframe like this:

       A      B      C
1  1.242    NaN  0.110
3    NaN -0.185 -0.209
5 -0.581  1.483    NaN

and i want to keep all nan as nan, it makes much more sense to have:

 df.reindex( [2, 4, 6], method='ffill' )
        A      B      C
2  1.242    NaN  0.110
4    NaN -0.185 -0.209
6 -0.581  1.483    NaN

just take whatever value there is ( nan or not nan ) and fill forward until the next available index. Reindexing should not enforce a mandatory fillna on the data.

This is completely different from

df.reindex( [2, 4, 6], method=None )

which produces

    A   B   C
2 NaN NaN NaN
4 NaN NaN NaN
6 NaN NaN NaN

Here is an example:

np.nan can just mean not applicable; say i have hourly data, and on weekends some calculations are just not applicable. I will fill nan for those columns during the weekends. now if I reindex to finer index, say every minute, the reindex will pick the last value from Friday, and fill it out for the whole weekend. This is wrong.

in reindexing a dataframe, forward flll means just take whatever value there is ( nan or not nan ) and fill forward until the next available index. A 'nan' value can be just an actual valid observation which you want to keep as is.

Reindexing should not enforce a mandatory fillna on the data.

like image 70
behzad.nouri Avatar answered Dec 16 '22 05:12

behzad.nouri