Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a lagged data structure using pandas dataframe

Tags:

python

pandas

Example

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5]) print s  1    5 2    4 3    3 4    2 5    1 

Is there an efficient way to create a series. e.g. containing in each row the lagged values (in this example up to lag 2)

3    [3, 4, 5] 4    [2, 3, 4] 5    [1, 2, 3] 

This corresponds to s=pd.Series([[3,4,5],[2,3,4],[1,2,3]], index=[3,4,5])

How can this be done in an efficient way for dataframes with a lot of timeseries which are very long?

Thanks

Edited after seeing the answers

ok, at the end I implemented this function:

def buildLaggedFeatures(s,lag=2,dropna=True): ''' Builds a new DataFrame to facilitate regressing over all possible lagged features ''' if type(s) is pd.DataFrame:     new_dict={}     for col_name in s:         new_dict[col_name]=s[col_name]         # create lagged Series         for l in range(1,lag+1):             new_dict['%s_lag%d' %(col_name,l)]=s[col_name].shift(l)     res=pd.DataFrame(new_dict,index=s.index)  elif type(s) is pd.Series:     the_range=range(lag+1)     res=pd.concat([s.shift(i) for i in the_range],axis=1)     res.columns=['lag_%d' %i for i in the_range] else:     print 'Only works for DataFrame or Series'     return None if dropna:     return res.dropna() else:     return res  

it produces the wished outputs and manages the naming of columns in the resulting DataFrame.

For a Series as input:

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5]) res=buildLaggedFeatures(s,lag=2,dropna=False)    lag_0  lag_1  lag_2 1      5    NaN    NaN 2      4      5    NaN 3      3      4      5 4      2      3      4 5      1      2      3 

and for a DataFrame as input:

s2=s=pd.DataFrame({'a':[5,4,3,2,1], 'b':[50,40,30,20,10]},index=[1,2,3,4,5]) res2=buildLaggedFeatures(s2,lag=2,dropna=True)     a  a_lag1  a_lag2   b  b_lag1  b_lag2 3  3       4       5  30      40      50 4  2       3       4  20      30      40 5  1       2       3  10      20      30 
like image 685
Mannaggia Avatar asked Dec 05 '13 20:12

Mannaggia


People also ask

Is pandas DataFrame efficient?

In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they're an integral part of the Python and NumPy ecosystems.


2 Answers

As mentioned, it could be worth looking into the rolling_ functions, which will mean you won't have as many copies around.

One solution is to concat shifted Series together to make a DataFrame:

In [11]: pd.concat([s, s.shift(), s.shift(2)], axis=1) Out[11]:     0   1   2 1  5 NaN NaN 2  4   5 NaN 3  3   4   5 4  2   3   4 5  1   2   3  In [12]: pd.concat([s, s.shift(), s.shift(2)], axis=1).dropna() Out[12]:     0  1  2 3  3  4  5 4  2  3  4 5  1  2  3 

Doing work on this will be more efficient that on lists...

like image 126
Andy Hayden Avatar answered Sep 29 '22 17:09

Andy Hayden


Very simple solution using pandas DataFrame:

number_lags = 3 df = pd.DataFrame(data={'vals':[5,4,3,2,1]}) for lag in xrange(1, number_lags + 1):     df['lag_' + str(lag)] = df.vals.shift(lag)  #if you want numpy arrays with no null values:  df.dropna().values for numpy arrays 

for Python 3.x (change xrange to range)

number_lags = 3 df = pd.DataFrame(data={'vals':[5,4,3,2,1]}) for lag in range(1, number_lags + 1):     df['lag_' + str(lag)] = df.vals.shift(lag)  print(df)     vals  lag_1  lag_2  lag_3 0     5    NaN    NaN    NaN 1     4    5.0    NaN    NaN 2     3    4.0    5.0    NaN 3     2    3.0    4.0    5.0 4     1    2.0    3.0    4.0 
like image 27
ansonw Avatar answered Sep 29 '22 17:09

ansonw