Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill datetimeindex gap by NaN

I have two dataframes which are datetimeindexed. One is missing a few of these datetimes (df1) while the other is complete (has regular timestamps without any gaps in this series) and is full of NaN's (df2).

I'm trying to match the values from df1 to the index of df2, filling with NaN's where such a datetimeindex doesn't exist in df1.

Example:

In  [51]: df1
Out [51]:                       value
          2015-01-01 14:00:00   20
          2015-01-01 15:00:00   29
          2015-01-01 16:00:00   41
          2015-01-01 17:00:00   43
          2015-01-01 18:00:00   26
          2015-01-01 19:00:00   20
          2015-01-01 20:00:00   31
          2015-01-01 21:00:00   35
          2015-01-01 22:00:00   39
          2015-01-01 23:00:00   17
          2015-03-01 00:00:00   6
          2015-03-01 01:00:00   37
          2015-03-01 02:00:00   56
          2015-03-01 03:00:00   12
          2015-03-01 04:00:00   41
          2015-03-01 05:00:00   31
          ...   ...

          2018-12-25 23:00:00   41

          <34843 rows × 1 columns>

In  [52]: df2 = pd.DataFrame(data=None, index=pd.date_range(freq='60Min', start=df1.index.min(), end=df1.index.max()))
          df2['value']=np.NaN
          df2
Out [52]:                       value
          2015-01-01 14:00:00   NaN
          2015-01-01 15:00:00   NaN
          2015-01-01 16:00:00   NaN
          2015-01-01 17:00:00   NaN
          2015-01-01 18:00:00   NaN
          2015-01-01 19:00:00   NaN
          2015-01-01 20:00:00   NaN
          2015-01-01 21:00:00   NaN
          2015-01-01 22:00:00   NaN
          2015-01-01 23:00:00   NaN
          2015-01-02 00:00:00   NaN
          2015-01-02 01:00:00   NaN
          2015-01-02 02:00:00   NaN
          2015-01-02 03:00:00   NaN
          2015-01-02 04:00:00   NaN
          2015-01-02 05:00:00   NaN
          ...                   ...
          2018-12-25 23:00:00   NaN

          <34906 rows × 1 columns>

Using df2.combine_first(df1) returns the same data as df1.reindex(index= df2.index), which fills any gaps where there shouldn't be data with some value, instead of NaN.

In  [53]: Result = df2.combine_first(df1)
          Result
Out [53]:                       value
          2015-01-01 14:00:00   20
          2015-01-01 15:00:00   29
          2015-01-01 16:00:00   41
          2015-01-01 17:00:00   43
          2015-01-01 18:00:00   26
          2015-01-01 19:00:00   20
          2015-01-01 20:00:00   31
          2015-01-01 21:00:00   35
          2015-01-01 22:00:00   39
          2015-01-01 23:00:00   17
          2015-01-02 00:00:00   35
          2015-01-02 01:00:00   53
          2015-01-02 02:00:00   28
          2015-01-02 03:00:00   48
          2015-01-02 04:00:00   42
          2015-01-02 05:00:00   51
          ...                   ...
          2018-12-25 23:00:00   41

          <34906 rows × 1 columns>

This is what I was hoping to get:

Out [53]:                       value
          2015-01-01 14:00:00   20
          2015-01-01 15:00:00   29
          2015-01-01 16:00:00   41
          2015-01-01 17:00:00   43
          2015-01-01 18:00:00   26
          2015-01-01 19:00:00   20
          2015-01-01 20:00:00   31
          2015-01-01 21:00:00   35
          2015-01-01 22:00:00   39
          2015-01-01 23:00:00   17
          2015-01-02 00:00:00   NaN
          2015-01-02 01:00:00   NaN
          2015-01-02 02:00:00   NaN
          2015-01-02 03:00:00   NaN
          2015-01-02 04:00:00   NaN
          2015-01-02 05:00:00   NaN
          ...                   ...
          2018-12-25 23:00:00   41

          <34906 rows × 1 columns>

Could someone shed some light on why this is happening, and how to set how these values are filled?

like image 858
tg359x Avatar asked Nov 09 '22 00:11

tg359x


1 Answers

IIUC you need resample df1, because you have an irregular frequency and you need regular frequency:

print df1.index.freq
None

print Result.index.freq
<60 * Minutes>

EDIT1
You can use function asfreq instead of resample - doc, resample vs asfreq.

EDIT2
First I think that resample didn't work, because after resampling the Result is the same as df1. But I try print df1.info() and print Result.info() gets different results - 34857 entries vs 34920 entries. So I try to find rows with NaN values and it returns 63 rows.

So I think resample works well.

import pandas as pd

df1 = pd.read_csv('test/GapInTimestamps.csv', sep=",", index_col=[0], parse_dates=[0])
print df1.head()

#                     value
#Date/Time                 
#2015-01-01 00:00:00     52
#2015-01-01 01:00:00      5
#2015-01-01 02:00:00     12
#2015-01-01 03:00:00     54
#2015-01-01 04:00:00     47
print df1.info()

#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34857 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Data columns (total 1 columns):
#value    34857 non-null int64
#dtypes: int64(1)
#memory usage: 544.6 KB
#None

Result  = df1.resample('60min')
print Result.head()

#                     value
#Date/Time                 
#2015-01-01 00:00:00     52
#2015-01-01 01:00:00      5
#2015-01-01 02:00:00     12
#2015-01-01 03:00:00     54
#2015-01-01 04:00:00     47
print Result.info()

#<class 'pandas.core.frame.DataFrame'>
#DatetimeIndex: 34920 entries, 2015-01-01 00:00:00 to 2018-12-25 23:00:00
#Freq: 60T
#Data columns (total 1 columns):
#value    34857 non-null float64
#dtypes: float64(1)
#memory usage: 545.6 KB
#None

#find values with NaN
resultnan =  Result[Result.isnull().any(axis=1)]
#temporaly display 999 rows and 15 columns
with pd.option_context('display.max_rows', 999, 'display.max_columns', 15):
    print resultnan

#                     value
#Date/Time                 
#2015-01-13 19:00:00    NaN
#2015-01-13 20:00:00    NaN
#2015-01-13 21:00:00    NaN
#2015-01-13 22:00:00    NaN
#2015-01-13 23:00:00    NaN
#2015-01-14 00:00:00    NaN
#2015-01-14 01:00:00    NaN
#2015-01-14 02:00:00    NaN
#2015-01-14 03:00:00    NaN
#2015-01-14 04:00:00    NaN
#2015-01-14 05:00:00    NaN
#2015-01-14 06:00:00    NaN
#2015-01-14 07:00:00    NaN
#2015-01-14 08:00:00    NaN
#2015-01-14 09:00:00    NaN
#2015-02-01 00:00:00    NaN
#2015-02-01 01:00:00    NaN
#2015-02-01 02:00:00    NaN
#2015-02-01 03:00:00    NaN
#2015-02-01 04:00:00    NaN
#2015-02-01 05:00:00    NaN
#2015-02-01 06:00:00    NaN
#2015-02-01 07:00:00    NaN
#2015-02-01 08:00:00    NaN
#2015-02-01 09:00:00    NaN
#2015-02-01 10:00:00    NaN
#2015-02-01 11:00:00    NaN
#2015-02-01 12:00:00    NaN
#2015-02-01 13:00:00    NaN
#2015-02-01 14:00:00    NaN
#2015-02-01 15:00:00    NaN
#2015-02-01 16:00:00    NaN
#2015-02-01 17:00:00    NaN
#2015-02-01 18:00:00    NaN
#2015-02-01 19:00:00    NaN
#2015-02-01 20:00:00    NaN
#2015-02-01 21:00:00    NaN
#2015-02-01 22:00:00    NaN
#2015-02-01 23:00:00    NaN
#2015-11-01 00:00:00    NaN
#2015-11-01 01:00:00    NaN
#2015-11-01 02:00:00    NaN
#2015-11-01 03:00:00    NaN
#2015-11-01 04:00:00    NaN
#2015-11-01 05:00:00    NaN
#2015-11-01 06:00:00    NaN
#2015-11-01 07:00:00    NaN
#2015-11-01 08:00:00    NaN
#2015-11-01 09:00:00    NaN
#2015-11-01 10:00:00    NaN
#2015-11-01 11:00:00    NaN
#2015-11-01 12:00:00    NaN
#2015-11-01 13:00:00    NaN
#2015-11-01 14:00:00    NaN
#2015-11-01 15:00:00    NaN
#2015-11-01 16:00:00    NaN
#2015-11-01 17:00:00    NaN
#2015-11-01 18:00:00    NaN
#2015-11-01 19:00:00    NaN
#2015-11-01 20:00:00    NaN
#2015-11-01 21:00:00    NaN
#2015-11-01 22:00:00    NaN
#2015-11-01 23:00:00    NaN
like image 149
jezrael Avatar answered Nov 15 '22 07:11

jezrael