How to create a lagged data structure using pandas dataframe

Tags:

python

pandas

Example

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5]) print s  1    5 2    4 3    3 4    2 5    1

Is there an efficient way to create a series. e.g. containing in each row the lagged values (in this example up to lag 2)

3    [3, 4, 5] 4    [2, 3, 4] 5    [1, 2, 3]

This corresponds to s=pd.Series([[3,4,5],[2,3,4],[1,2,3]], index=[3,4,5])

How can this be done in an efficient way for dataframes with a lot of timeseries which are very long?

Thanks

Edited after seeing the answers

ok, at the end I implemented this function:

def buildLaggedFeatures(s,lag=2,dropna=True): ''' Builds a new DataFrame to facilitate regressing over all possible lagged features ''' if type(s) is pd.DataFrame:     new_dict={}     for col_name in s:         new_dict[col_name]=s[col_name]         # create lagged Series         for l in range(1,lag+1):             new_dict['%s_lag%d' %(col_name,l)]=s[col_name].shift(l)     res=pd.DataFrame(new_dict,index=s.index)  elif type(s) is pd.Series:     the_range=range(lag+1)     res=pd.concat([s.shift(i) for i in the_range],axis=1)     res.columns=['lag_%d' %i for i in the_range] else:     print 'Only works for DataFrame or Series'     return None if dropna:     return res.dropna() else:     return res

it produces the wished outputs and manages the naming of columns in the resulting DataFrame.

For a Series as input:

s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5]) res=buildLaggedFeatures(s,lag=2,dropna=False)    lag_0  lag_1  lag_2 1      5    NaN    NaN 2      4      5    NaN 3      3      4      5 4      2      3      4 5      1      2      3

and for a DataFrame as input:

s2=s=pd.DataFrame({'a':[5,4,3,2,1], 'b':[50,40,30,20,10]},index=[1,2,3,4,5]) res2=buildLaggedFeatures(s2,lag=2,dropna=True)     a  a_lag1  a_lag2   b  b_lag1  b_lag2 3  3       4       5  30      40      50 4  2       3       4  20      30      40 5  1       2       3  10      20      30

685

asked Dec 05 '13 20:12

Mannaggia

2 Answers

As mentioned, it could be worth looking into the rolling_ functions, which will mean you won't have as many copies around.

One solution is to concat shifted Series together to make a DataFrame:

In [11]: pd.concat([s, s.shift(), s.shift(2)], axis=1) Out[11]:     0   1   2 1  5 NaN NaN 2  4   5 NaN 3  3   4   5 4  2   3   4 5  1   2   3  In [12]: pd.concat([s, s.shift(), s.shift(2)], axis=1).dropna() Out[12]:     0  1  2 3  3  4  5 4  2  3  4 5  1  2  3

Doing work on this will be more efficient that on lists...

126

answered Sep 29 '22 17:09

Andy Hayden

Very simple solution using pandas DataFrame:

number_lags = 3 df = pd.DataFrame(data={'vals':[5,4,3,2,1]}) for lag in xrange(1, number_lags + 1):     df['lag_' + str(lag)] = df.vals.shift(lag)  #if you want numpy arrays with no null values:  df.dropna().values for numpy arrays

for Python 3.x (change xrange to range)

number_lags = 3 df = pd.DataFrame(data={'vals':[5,4,3,2,1]}) for lag in range(1, number_lags + 1):     df['lag_' + str(lag)] = df.vals.shift(lag)  print(df)     vals  lag_1  lag_2  lag_3 0     5    NaN    NaN    NaN 1     4    5.0    NaN    NaN 2     3    4.0    5.0    NaN 3     2    3.0    4.0    5.0 4     1    2.0    3.0    4.0

answered Sep 29 '22 17:09

ansonw

Related questions
                            
                                Unable to show legend in seaborn distplot
                            
                                UUID('...') is not JSON serializable
                            
                                How to change the legend edgecolor and facecolor in matplotlib
                            
                                How to use values stored in variables as case patterns?
                            
                                Idiomatic Python: 'times' loop [duplicate]
                            
                                Product code looks like abcd2343, how to split by letters and numbers?
                            
                                Create new class instance from class method
                            
                                Detect "overall average" color of the picture
                            
                                python circular imports once again (aka what's wrong with this design)
                            
                                Split string at nth occurrence of a given character
                            
                                Why do I get only one parameter from a statsmodels OLS fit
                            
                                Why is this printing 'None' in the output? [duplicate]
                            
                                Python: multiplication override
                            
                                How to make custom legend in matplotlib
                            
                                Upload image available at public URL to S3 using boto
                            
                                Convert a string key to int in a Dictionary
                            
                                How to do a conditional count after groupby on a Pandas Dataframe?
                            
                                Axes from plt.subplots() is a "numpy.ndarray" object and has no attribute "plot"
                            
                                Python - How to send utf-8 e-mail?
                            
                                Best way to identify and extract dates from text Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With