I have dataframes with DateTime indices of various types (could be weekly, monthly, annual data). I want to generate columns that are lagged values of other columns. I get these imported from a spreadsheet, I'm not generating the datetime index inside python.
I'm struggling to find the 'pythonic' way of doing this. I figure if I use Pandas' datetime capability, the lagging might be more robust in the case of weird or exceptional data.
I made a toy example that seems to work, but it fails on my real-world example.
The toy example which correctly works (makes a new column that has the 'foo' value of the previous month)
rng = pd.date_range('2012-01-01', '2013-1-01', freq="M")
toy2 = pd.DataFrame(pd.Series(np.random.randint(0, 50, len(rng)), index=rng, name="foo"))
foo
2012-01-31 4
2012-02-29 2
2012-03-31 27
2012-04-30 7
2012-05-31 44
2012-06-30 22
2012-07-31 16
2012-08-31 18
2012-09-30 35
2012-10-31 35
2012-11-30 16
2012-12-31 32
toy2['lag_foo']= toy2['foo'].shift(1,'m')
foo lag_foo
2012-01-31 4 NaN
2012-02-29 2 4.0
2012-03-31 27 2.0
2012-04-30 7 27.0
2012-05-31 44 7.0
2012-06-30 22 44.0
2012-07-31 16 22.0
2012-08-31 18 16.0
2012-09-30 35 18.0
2012-10-31 35 35.0
2012-11-30 16 35.0
2012-12-31 32 16.0
But when I run this on my real-life example, it fails with:
ValueError: cannot reindex from a duplicate axis
print type(toy)
print toy.columns
print toy['IPE m2'][0:5]
<class 'pandas.core.frame.DataFrame'>
Index([u'IPE m2'], dtype='object')
Date
2016-04-30 43.29
2016-03-31 40.44
2016-02-29 34.17
2016-01-31 32.47
2015-12-31 39.35
Name: IPE m2, dtype: float64
The exception trace:
ValueError Traceback (most recent call last)
<ipython-input-170-9cb57a2ed681> in <module>()
----> 1 toy['prev_1m']= toy['IPE m2'].shift(1,'m')
C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in __setitem__(self, key, value)
2355 else:
2356 # set column
-> 2357 self._set_item(key, value)
2358
2359 def _setitem_slice(self, key, value):
C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _set_item(self, key, value)
2421
2422 self._ensure_valid_index(value)
-> 2423 value = self._sanitize_column(key, value)
2424 NDFrame._set_item(self, key, value)
2425
C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _sanitize_column(self, key, value)
2555
2556 if isinstance(value, Series):
-> 2557 value = reindexer(value)
2558
2559 elif isinstance(value, DataFrame):
C:\Users\mds\Anaconda2\lib\site-packages\pandas\core\frame.pyc in reindexer(value)
2547 # duplicate axis
2548 if not value.index.is_unique:
-> 2549 raise e
2550
2551 # other
ValueError: cannot reindex from a duplicate axis
Seems like I'm missing some subtlety of Pandas datetime indices I think. Plus I'm not even sure this is the ideal way to do this. the only thing I could suspect is that the non-working toy.index has None as the freq, while the working toy2 example, has its frequency set as 'M'
toy.index
DatetimeIndex(['2016-04-30', '2016-03-31', '2016-02-29', '2016-01-31',
'2015-12-31', '2015-11-30', '2015-10-31', '2015-09-30',
'2015-08-31', '2015-07-31',
...
'NaT', 'NaT', 'NaT', 'NaT',
'NaT', 'NaT', 'NaT', 'NaT',
'NaT', 'NaT'],
dtype='datetime64[ns]', name=u'Date', length=142, freq=None)
toy2.index
DatetimeIndex(['2012-01-31', '2012-02-29', '2012-03-31', '2012-04-30',
'2012-05-31', '2012-06-30', '2012-07-31', '2012-08-31',
'2012-09-30', '2012-10-31', '2012-11-30', '2012-12-31'],
dtype='datetime64[ns]', freq='M')
In [ ]:
===========================
I threw away the NaT's
toy = toy.dropna()
toy['prev_1m']= toy['IPE m2'].shift(1,'m')
and I do get the results I wanted. However, I also get a warning:
C:\Users\mds\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
if __name__ == '__main__':
this way of assignment suppresses the warnings:
toy.loc[:,'prev_1m2']= toy['IPE m2'].shift(1,'m')
There is another problem - many NaT
in index in toy
DataFrame
, so index
has duplicates values. (Maybe some datetime are duplicated too.)
Sample:
import pandas as pd
import numpy as np
rng = pd.date_range('2012-01-01', '2013-1-01', freq="M")
toy2 = pd.DataFrame(pd.Series(np.random.randint(0, 50, len(rng)), index=rng, name="foo"))
df = pd.DataFrame({'foo': [10,30,19]}, index=[np.nan, np.nan, np.nan])
print (df)
foo
NaN 10
NaN 30
NaN 19
toy2 = pd.concat([toy2, df])
print (toy2)
foo
2012-01-31 18
2012-02-29 34
2012-03-31 43
2012-04-30 17
2012-05-31 45
2012-06-30 8
2012-07-31 36
2012-08-31 26
2012-09-30 5
2012-10-31 18
2012-11-30 39
2012-12-31 3
NaT 10
NaT 30
NaT 19
toy2['lag_foo']= toy2['foo'].shift(1,'m')
print (toy2)
ValueError: cannot reindex from a duplicate axis
One possible solution can be omit parameter freq=m
:
toy2['lag_foo']= toy2['foo'].shift(1)
print (toy2)
foo lag_foo
2012-01-31 21 NaN
2012-02-29 13 21.0
2012-03-31 41 13.0
2012-04-30 38 41.0
2012-05-31 15 38.0
2012-06-30 41 15.0
2012-07-31 30 41.0
2012-08-31 18 30.0
2012-09-30 12 18.0
2012-10-31 35 12.0
2012-11-30 23 35.0
2012-12-31 7 23.0
NaT 10 7.0
NaT 30 10.0
NaT 19 30.0
If need remove all records with NaN
(NaT
) in index
, use notnull
with boolean indexing
:
print (toy2)
foo
2012-01-31 41
2012-02-29 15
2012-03-31 8
2012-04-30 2
2012-05-31 16
2012-06-30 43
2012-07-31 2
2012-08-31 15
2012-09-30 3
2012-10-31 46
2012-11-30 34
2012-12-31 36
NaT 10
NaT 30
NaT 19
toy2 = toy2[pd.notnull(toy2.index)]
toy2['lag_foo']= toy2['foo'].shift(1, 'm')
print (toy2)
foo lag_foo
2012-01-31 41 NaN
2012-02-29 15 41.0
2012-03-31 8 15.0
2012-04-30 2 8.0
2012-05-31 16 2.0
2012-06-30 43 16.0
2012-07-31 2 43.0
2012-08-31 15 2.0
2012-09-30 3 15.0
2012-10-31 46 3.0
2012-11-30 34 46.0
2012-12-31 36 34.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With