Example
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5]) print s 1 5 2 4 3 3 4 2 5 1
Is there an efficient way to create a series. e.g. containing in each row the lagged values (in this example up to lag 2)
3 [3, 4, 5] 4 [2, 3, 4] 5 [1, 2, 3]
This corresponds to s=pd.Series([[3,4,5],[2,3,4],[1,2,3]], index=[3,4,5])
How can this be done in an efficient way for dataframes with a lot of timeseries which are very long?
Thanks
Edited after seeing the answers
ok, at the end I implemented this function:
def buildLaggedFeatures(s,lag=2,dropna=True): ''' Builds a new DataFrame to facilitate regressing over all possible lagged features ''' if type(s) is pd.DataFrame: new_dict={} for col_name in s: new_dict[col_name]=s[col_name] # create lagged Series for l in range(1,lag+1): new_dict['%s_lag%d' %(col_name,l)]=s[col_name].shift(l) res=pd.DataFrame(new_dict,index=s.index) elif type(s) is pd.Series: the_range=range(lag+1) res=pd.concat([s.shift(i) for i in the_range],axis=1) res.columns=['lag_%d' %i for i in the_range] else: print 'Only works for DataFrame or Series' return None if dropna: return res.dropna() else: return res
it produces the wished outputs and manages the naming of columns in the resulting DataFrame.
For a Series as input:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5]) res=buildLaggedFeatures(s,lag=2,dropna=False) lag_0 lag_1 lag_2 1 5 NaN NaN 2 4 5 NaN 3 3 4 5 4 2 3 4 5 1 2 3
and for a DataFrame as input:
s2=s=pd.DataFrame({'a':[5,4,3,2,1], 'b':[50,40,30,20,10]},index=[1,2,3,4,5]) res2=buildLaggedFeatures(s2,lag=2,dropna=True) a a_lag1 a_lag2 b b_lag1 b_lag2 3 3 4 5 30 40 50 4 2 3 4 20 30 40 5 1 2 3 10 20 30
In many cases, DataFrames are faster, easier to use, and more powerful than tables or spreadsheets because they're an integral part of the Python and NumPy ecosystems.
As mentioned, it could be worth looking into the rolling_ functions, which will mean you won't have as many copies around.
One solution is to concat shifted Series together to make a DataFrame:
In [11]: pd.concat([s, s.shift(), s.shift(2)], axis=1) Out[11]: 0 1 2 1 5 NaN NaN 2 4 5 NaN 3 3 4 5 4 2 3 4 5 1 2 3 In [12]: pd.concat([s, s.shift(), s.shift(2)], axis=1).dropna() Out[12]: 0 1 2 3 3 4 5 4 2 3 4 5 1 2 3
Doing work on this will be more efficient that on lists...
Very simple solution using pandas DataFrame:
number_lags = 3 df = pd.DataFrame(data={'vals':[5,4,3,2,1]}) for lag in xrange(1, number_lags + 1): df['lag_' + str(lag)] = df.vals.shift(lag) #if you want numpy arrays with no null values: df.dropna().values for numpy arrays
for Python 3.x (change xrange to range)
number_lags = 3 df = pd.DataFrame(data={'vals':[5,4,3,2,1]}) for lag in range(1, number_lags + 1): df['lag_' + str(lag)] = df.vals.shift(lag) print(df) vals lag_1 lag_2 lag_3 0 5 NaN NaN NaN 1 4 5.0 NaN NaN 2 3 4.0 5.0 NaN 3 2 3.0 4.0 5.0 4 1 2.0 3.0 4.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With