Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filling continuous pandas dataframe from sparse dataframe

I have a dictionary name date_dict keyed by datetime dates with values corresponding to integer counts of observations. I convert this to a sparse series/dataframe with censored observations that I would like to join or convert to a series/dataframe with continuous dates. The nasty list comprehension is my hack to get around the fact that pandas apparently won't automatically covert datetime date objects to an appropriate DateTime index.

df1 = pd.DataFrame(data=date_dict.values(),
                   index=[datetime.datetime.combine(i, datetime.time()) 
                          for i in date_dict.keys()],
                   columns=['Name'])
df1 = df1.sort(axis=0)

This example has 1258 observations and the DateTime index runs from 2003-06-24 to 2012-11-07.

df1.head()
             Name
Date
2003-06-24   2
2003-08-13   1
2003-08-19   2
2003-08-22   1
2003-08-24   5

I can create an empty dataframe with a continuous DateTime index, but this introduces an unneeded column and seems clunky. I feel as though I'm missing a more elegant solution involving a join.

df2 = pd.DataFrame(data=None,columns=['Empty'],
                   index=pd.DateRange(min(date_dict.keys()),
                                      max(date_dict.keys())))
df3 = df1.join(df2,how='right')
df3.head()
            Name    Empty
2003-06-24   2   NaN
2003-06-25  NaN  NaN
2003-06-26  NaN  NaN
2003-06-27  NaN  NaN
2003-06-30  NaN  NaN

Is there a simpler or more elegant way to fill a continuous dataframe from a sparse dataframe so that there is (1) a continuous index, (2) the NaNs are 0s, and (3) there is no left-over empty column in the dataframe?

            Name
2003-06-24   2
2003-06-25   0
2003-06-26   0
2003-06-27   0
2003-06-30   0
like image 977
Brian Keegan Avatar asked Nov 13 '12 23:11

Brian Keegan


1 Answers

You can just use reindex on a time series using your date range. Also it looks like you would be better off using a TimeSeries instead of a DataFrame (see documentation), although reindexing is also the correct method for adding missing index values to DataFrames as well.

For example, starting with:

date_index = pd.DatetimeIndex([pd.datetime(2003,6,24), pd.datetime(2003,8,13),
        pd.datetime(2003,8,19), pd.datetime(2003,8,22), pd.datetime(2003,8,24)])

ts = pd.Series([2,1,2,1,5], index=date_index)

Gives you a time series like your example dataframe's head:

2003-06-24    2
2003-08-13    1
2003-08-19    2
2003-08-22    1
2003-08-24    5

Simply doing

ts.reindex(pd.date_range(min(date_index), max(date_index)))

then gives you a complete index, with NaNs for your missing values (you can use fillna if you want to fill the missing values with some other values - see here):

2003-06-24     2
2003-06-25   NaN
2003-06-26   NaN
2003-06-27   NaN
2003-06-28   NaN
2003-06-29   NaN
2003-06-30   NaN
2003-07-01   NaN
2003-07-02   NaN
2003-07-03   NaN
2003-07-04   NaN
2003-07-05   NaN
2003-07-06   NaN
2003-07-07   NaN
2003-07-08   NaN
2003-07-09   NaN
2003-07-10   NaN
2003-07-11   NaN
2003-07-12   NaN
2003-07-13   NaN
2003-07-14   NaN
2003-07-15   NaN
2003-07-16   NaN
2003-07-17   NaN
2003-07-18   NaN
2003-07-19   NaN
2003-07-20   NaN
2003-07-21   NaN
2003-07-22   NaN
2003-07-23   NaN
2003-07-24   NaN
2003-07-25   NaN
2003-07-26   NaN
2003-07-27   NaN
2003-07-28   NaN
2003-07-29   NaN
2003-07-30   NaN
2003-07-31   NaN
2003-08-01   NaN
2003-08-02   NaN
2003-08-03   NaN
2003-08-04   NaN
2003-08-05   NaN
2003-08-06   NaN
2003-08-07   NaN
2003-08-08   NaN
2003-08-09   NaN
2003-08-10   NaN
2003-08-11   NaN
2003-08-12   NaN
2003-08-13     1
2003-08-14   NaN
2003-08-15   NaN
2003-08-16   NaN
2003-08-17   NaN
2003-08-18   NaN
2003-08-19     2
2003-08-20   NaN
2003-08-21   NaN
2003-08-22     1
2003-08-23   NaN
2003-08-24     5
Freq: D, Length: 62
like image 170
Matti John Avatar answered Sep 22 '22 04:09

Matti John