Search in pandas dataframe

Question

Potentially a slightly misleading title but the problem is this:

I have a large dataframe with multiple columns. This looks a bit like

df = 
id   date        value
 A   01-01-2015    1.0
 A   03-01-2015    1.2
 ...
 B   01-01-2015    0.8
 B   02-01-2015    0.8
 ...

What I want to do is within each of the IDs I identify the date one week earlier and place the value on this date into e.g. a 'lagvalue' column. The problem comes with not all dates existing for all ids so a simple .shift(7) won't pull the correct value [in this instance I guess I should put a NaN in].

I can do this with a lot of horrible iterating over the dates and ids to find the value, for example some rough idea

[
  df[
    df['date'] == df['date'].iloc[i] - datetime.timedelta(weeks=1)
  ][
    df['id'] == df['id'].iloc[i]
  ]['value']
  for i in range(len(df.index))
]

but I'm certain there is a 'better' way to do it that cuts down on time and processing that I just can't think of right now.

I could write a function using a groupby on the id and then look within that and I'm certain that would reduce the time it would take to perform the operation - is there a much quicker, simpler way [aka am I having a dim day]?

Kevin S · Accepted Answer

Basic strategy is, for each id, to:

Use date index
Use reindex to expand the data to include all dates
Use shift to shift 7 spots
Use ffill to do last value interpolation. I'm not sure if you want this, or possibly bfill which will use the last value less than a week in the past. But simple to change. Alternatively, if you want NaN when not available 7 days in the past, you can just remove the *fill completely.
Drop unneeded data

This algorithm gives NaN when the lag is too far in the past.

There are a few assumptions here. In particular that the dates are unique inside each id and they are sorted. If not sorted, then use sort_values to sort by id and date. If there are duplicate dates, then some rules will be needed to resolve which values to use.

import pandas as pd
import numpy as np

dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::3]
A = pd.DataFrame({'date':dates,
                  'id':['A']*len(dates),
                  'value':np.random.randn(len(dates))})

dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::5]
B = pd.DataFrame({'date':dates,
                  'id':['B']*len(dates),
                  'value':np.random.randn(len(dates))})
df = pd.concat([A,B])

with_lags = []
for id, group in df.groupby('id'):
    group = group.set_index(group.date)
    index = group.index
    group = group.reindex(pd.date_range(group.index[0],group.index[-1]))
    group = group.ffill()
    group['lag_value'] = group.value.shift(7)
    group = group.loc[index]
    with_lags.append(group)

with_lags = pd.concat(with_lags, 0)
with_lags.index = np.arange(with_lags.shape[0])

Search in pandas dataframe

Tags:

python

pandas

fūjin

1 Answers

Kevin S

Recent Activity

Donate For Us

Search in pandas dataframe

Tags:

python

pandas

fūjin

1 Answers

Kevin S

Related questions

Recent Activity

Donate For Us