Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas filling missing dates and values within group

I've a data frame that looks like the following

x = pd.DataFrame({'user': ['a','a','b','b'], 'dt': ['2016-01-01','2016-01-02', '2016-01-05','2016-01-06'], 'val': [1,33,2,1]}) 

What I would like to be able to do is find the minimum and maximum date within the date column and expand that column to have all the dates there while simultaneously filling in 0 for the val column. So the desired output is

            dt user  val 0   2016-01-01    a    1 1   2016-01-02    a   33 2   2016-01-03    a    0 3   2016-01-04    a    0 4   2016-01-05    a    0 5   2016-01-06    a    0 6   2016-01-01    b    0 7   2016-01-02    b    0 8   2016-01-03    b    0 9   2016-01-04    b    0 10  2016-01-05    b    2 11  2016-01-06    b    1 

I've tried the solution mentioned here and here but they aren't what I'm after. Any pointers much appreciated.

like image 403
broccoli Avatar asked Jul 07 '17 19:07

broccoli


People also ask

How do you handle missing date values?

- Missing Data: The missing data can be handled in multiple ways such as: Ignoring the data, filling the data with some constant value, filling the data with a corresponding measure of central tendency like mean/ median.

Why is missing data filled in DataFrame with some value?

Missing Data can also refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets simply arrive with missing data, either because it exists and was not collected or it never existed.

What is used to fill missing values in pandas?

Filling missing values: fillna With time series data, using pad/ffill is extremely common so that the “last known value” is available at every time point.


2 Answers

Initial Dataframe:

            dt  user    val 0   2016-01-01     a      1 1   2016-01-02     a     33 2   2016-01-05     b      2 3   2016-01-06     b      1 

First, convert the dates to datetime:

x['dt'] = pd.to_datetime(x['dt']) 

Then, generate the dates and unique users:

dates = x.set_index('dt').resample('D').asfreq().index  >> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',                '2016-01-05', '2016-01-06'],               dtype='datetime64[ns]', name='dt', freq='D')  users = x['user'].unique()  >> array(['a', 'b'], dtype=object) 

This will allow you to create a MultiIndex:

idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])  >> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],            labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],            names=['dt', 'user']) 

You can use that to reindex your DataFrame:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index() Out:             dt user  val 0  2016-01-01    a    1 1  2016-01-01    b    0 2  2016-01-02    a   33 3  2016-01-02    b    0 4  2016-01-03    a    0 5  2016-01-03    b    0 6  2016-01-04    a    0 7  2016-01-04    b    0 8  2016-01-05    a    0 9  2016-01-05    b    2 10 2016-01-06    a    0 11 2016-01-06    b    1 

which then can be sorted by users:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user') Out:             dt user  val 0  2016-01-01    a    1 2  2016-01-02    a   33 4  2016-01-03    a    0 6  2016-01-04    a    0 8  2016-01-05    a    0 10 2016-01-06    a    0 1  2016-01-01    b    0 3  2016-01-02    b    0 5  2016-01-03    b    0 7  2016-01-04    b    0 9  2016-01-05    b    2 11 2016-01-06    b    1 
like image 96
ayhan Avatar answered Sep 18 '22 23:09

ayhan


As @ayhan suggests

x.dt = pd.to_datetime(x.dt) 

One-liner using mostly @ayhan's ideas while incorporating stack/unstack and fill_value

x.set_index(     ['dt', 'user'] ).unstack(     fill_value=0 ).asfreq(     'D', fill_value=0 ).stack().sort_index(level=1).reset_index()             dt user  val 0  2016-01-01    a    1 1  2016-01-02    a   33 2  2016-01-03    a    0 3  2016-01-04    a    0 4  2016-01-05    a    0 5  2016-01-06    a    0 6  2016-01-01    b    0 7  2016-01-02    b    0 8  2016-01-03    b    0 9  2016-01-04    b    0 10 2016-01-05    b    2 11 2016-01-06    b    1 
like image 25
piRSquared Avatar answered Sep 19 '22 23:09

piRSquared