Potentially a slightly misleading title but the problem is this:
I have a large dataframe with multiple columns. This looks a bit like
df =
id date value
A 01-01-2015 1.0
A 03-01-2015 1.2
...
B 01-01-2015 0.8
B 02-01-2015 0.8
...
What I want to do is within each of the IDs I identify the date one week earlier and place the value on this date into e.g. a 'lagvalue' column. The problem comes with not all dates existing for all ids so a simple .shift(7) won't pull the correct value [in this instance I guess I should put a NaN in].
I can do this with a lot of horrible iterating over the dates and ids to find the value, for example some rough idea
[
df[
df['date'] == df['date'].iloc[i] - datetime.timedelta(weeks=1)
][
df['id'] == df['id'].iloc[i]
]['value']
for i in range(len(df.index))
]
but I'm certain there is a 'better' way to do it that cuts down on time and processing that I just can't think of right now.
I could write a function using a groupby on the id and then look within that and I'm certain that would reduce the time it would take to perform the operation - is there a much quicker, simpler way [aka am I having a dim day]?
Basic strategy is, for each id, to:
reindex to expand the data to include all datesshift to shift 7 spotsffill to do last value interpolation. I'm not sure if you want this, or possibly bfill which will use the last value less than a week in the past. But simple to change. Alternatively, if you want NaN when not available 7 days in the past, you can just remove the *fill completely.This algorithm gives NaN when the lag is too far in the past.
There are a few assumptions here. In particular that the dates are unique inside each id and they are sorted. If not sorted, then use sort_values to sort by id and date. If there are duplicate dates, then some rules will be needed to resolve which values to use.
import pandas as pd
import numpy as np
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::3]
A = pd.DataFrame({'date':dates,
'id':['A']*len(dates),
'value':np.random.randn(len(dates))})
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::5]
B = pd.DataFrame({'date':dates,
'id':['B']*len(dates),
'value':np.random.randn(len(dates))})
df = pd.concat([A,B])
with_lags = []
for id, group in df.groupby('id'):
group = group.set_index(group.date)
index = group.index
group = group.reindex(pd.date_range(group.index[0],group.index[-1]))
group = group.ffill()
group['lag_value'] = group.value.shift(7)
group = group.loc[index]
with_lags.append(group)
with_lags = pd.concat(with_lags, 0)
with_lags.index = np.arange(with_lags.shape[0])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With