Example
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5]) print s  1    5 2    4 3    3 4    2 5    1   Is there an efficient way to create a series. e.g. containing in each row the lagged values (in this example up to lag 2)
3    [3, 4, 5] 4    [2, 3, 4] 5    [1, 2, 3]   This corresponds to s=pd.Series([[3,4,5],[2,3,4],[1,2,3]], index=[3,4,5])
How can this be done in an efficient way for dataframes with a lot of timeseries which are very long?
Thanks
Edited after seeing the answers
ok, at the end I implemented this function:
def buildLaggedFeatures(s,lag=2,dropna=True): ''' Builds a new DataFrame to facilitate regressing over all possible lagged features ''' if type(s) is pd.DataFrame:     new_dict={}     for col_name in s:         new_dict[col_name]=s[col_name]         # create lagged Series         for l in range(1,lag+1):             new_dict['%s_lag%d' %(col_name,l)]=s[col_name].shift(l)     res=pd.DataFrame(new_dict,index=s.index)  elif type(s) is pd.Series:     the_range=range(lag+1)     res=pd.concat([s.shift(i) for i in the_range],axis=1)     res.columns=['lag_%d' %i for i in the_range] else:     print 'Only works for DataFrame or Series'     return None if dropna:     return res.dropna() else:     return res    it produces the wished outputs and manages the naming of columns in the resulting DataFrame.
For a Series as input:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5]) res=buildLaggedFeatures(s,lag=2,dropna=False)    lag_0  lag_1  lag_2 1      5    NaN    NaN 2      4      5    NaN 3      3      4      5 4      2      3      4 5      1      2      3   and for a DataFrame as input:
s2=s=pd.DataFrame({'a':[5,4,3,2,1], 'b':[50,40,30,20,10]},index=[1,2,3,4,5]) res2=buildLaggedFeatures(s2,lag=2,dropna=True)     a  a_lag1  a_lag2   b  b_lag1  b_lag2 3  3       4       5  30      40      50 4  2       3       4  20      30      40 5  1       2       3  10      20      30 
                In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they're an integral part of the Python and NumPy ecosystems.
As mentioned, it could be worth looking into the rolling_ functions, which will mean you won't have as many copies around.
One solution is to concat shifted Series together to make a DataFrame:
In [11]: pd.concat([s, s.shift(), s.shift(2)], axis=1) Out[11]:     0   1   2 1  5 NaN NaN 2  4   5 NaN 3  3   4   5 4  2   3   4 5  1   2   3  In [12]: pd.concat([s, s.shift(), s.shift(2)], axis=1).dropna() Out[12]:     0  1  2 3  3  4  5 4  2  3  4 5  1  2  3   Doing work on this will be more efficient that on lists...
Very simple solution using pandas DataFrame:
number_lags = 3 df = pd.DataFrame(data={'vals':[5,4,3,2,1]}) for lag in xrange(1, number_lags + 1):     df['lag_' + str(lag)] = df.vals.shift(lag)  #if you want numpy arrays with no null values:  df.dropna().values for numpy arrays   for Python 3.x (change xrange to range)
number_lags = 3 df = pd.DataFrame(data={'vals':[5,4,3,2,1]}) for lag in range(1, number_lags + 1):     df['lag_' + str(lag)] = df.vals.shift(lag)  print(df)     vals  lag_1  lag_2  lag_3 0     5    NaN    NaN    NaN 1     4    5.0    NaN    NaN 2     3    4.0    5.0    NaN 3     2    3.0    4.0    5.0 4     1    2.0    3.0    4.0 
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With