I have a time-series DataFrame and I want to replicate each of my 200 features/columns as additional lagged features. So at the moment I have features at time t and want to create features at timestep t-1, t-2 and so on.
I know this is best done with df.shift() but I'm having trouble putting it altogether. I want to also rename the columns to 'feature (t-1)', 'feature (t-2)'.
My pseudo-code attempt would be something like:
lagged_values = [1,2,3,10]
for every lagged_values
for every column, make a new feature column with df.shift(lagged_values)
make new column have name 'original col name'+'(t-(lagged_values))'
In the end if I have 200 columns and 4 lagged timesteps I would have a new df with 1,000 features (200 each at t, t-1, t-2, t-3 and t-10).
I have found something similar but it doesn't keep the original column names (renames to var1, var2, etc) as per machine learning mastery. Unfortunately I don't understand it well enough to modify it to my problem.
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
"""
Frame a time series as a supervised learning dataset.
Arguments:
data: Sequence of observations as a list or NumPy array.
n_in: Number of lag observations as input (X).
n_out: Number of observations as output (y).
dropnan: Boolean whether or not to drop rows with NaN values.
Returns:
Pandas DataFrame of series framed for supervised learning.
"""
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
You can use the shift() function in pandas to create a column that displays the lagged values of another column. Note that the value in the shift() function indicates the number of values to calculate the lag for.
shift() If you want to shift your column or subtract the column value with the previous row value from the DataFrame, you can do it by using the shift() function. It consists of a scalar parameter called period, which is responsible for showing the number of shifts to be made over the desired axis.
Create lag variables, using the shift function. shift(1) creates a lag of a single record, while shift(5) creates a lag of five records. This creates a lag variable based on the prior observations, but shift can also take a time offset to specify the time to use in shift.
In Python, the pandas library includes built-in functionalities that allow you to perform different tasks with only a few lines of code. One of these functionalities is the creation of lags and leads of a column. lag shifts a column down by a certain number. lead shifts a column up by a certain number.
You can create the additional columns using a dictionary comprehension and then add them to your dataframe via assign
.
df = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
lags = range(1, 3) # Just two lags for demonstration.
>>> df.assign(**{
f'{col} (t-{lag})': df[col].shift(lag)
for lag in lags
for col in df
})
A B A (t-1) A (t-2) B (t-1) B (t-2)
0 -0.773571 1.945746 NaN NaN NaN NaN
1 1.375648 0.058043 -0.773571 NaN 1.945746 NaN
2 0.727642 1.802386 1.375648 -0.773571 0.058043 1.945746
3 -2.427135 -0.780636 0.727642 1.375648 1.802386 0.058043
4 1.542809 -0.620816 -2.427135 0.727642 -0.780636 1.802386
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With