Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search in pandas dataframe

Tags:

python

pandas

Potentially a slightly misleading title but the problem is this:

I have a large dataframe with multiple columns. This looks a bit like

df = 
id   date        value
 A   01-01-2015    1.0
 A   03-01-2015    1.2
 ...
 B   01-01-2015    0.8
 B   02-01-2015    0.8
 ...

What I want to do is within each of the IDs I identify the date one week earlier and place the value on this date into e.g. a 'lagvalue' column. The problem comes with not all dates existing for all ids so a simple .shift(7) won't pull the correct value [in this instance I guess I should put a NaN in].

I can do this with a lot of horrible iterating over the dates and ids to find the value, for example some rough idea

[
  df[
    df['date'] == df['date'].iloc[i] - datetime.timedelta(weeks=1)
  ][
    df['id'] == df['id'].iloc[i]
  ]['value']
  for i in range(len(df.index))
]

but I'm certain there is a 'better' way to do it that cuts down on time and processing that I just can't think of right now.

I could write a function using a groupby on the id and then look within that and I'm certain that would reduce the time it would take to perform the operation - is there a much quicker, simpler way [aka am I having a dim day]?

like image 291
fūjin Avatar asked Apr 27 '26 20:04

fūjin


1 Answers

Basic strategy is, for each id, to:

  • Use date index
  • Use reindex to expand the data to include all dates
  • Use shift to shift 7 spots
  • Use ffill to do last value interpolation. I'm not sure if you want this, or possibly bfill which will use the last value less than a week in the past. But simple to change. Alternatively, if you want NaN when not available 7 days in the past, you can just remove the *fill completely.
  • Drop unneeded data

This algorithm gives NaN when the lag is too far in the past.

There are a few assumptions here. In particular that the dates are unique inside each id and they are sorted. If not sorted, then use sort_values to sort by id and date. If there are duplicate dates, then some rules will be needed to resolve which values to use.

import pandas as pd
import numpy as np

dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::3]
A = pd.DataFrame({'date':dates,
                  'id':['A']*len(dates),
                  'value':np.random.randn(len(dates))})

dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::5]
B = pd.DataFrame({'date':dates,
                  'id':['B']*len(dates),
                  'value':np.random.randn(len(dates))})
df = pd.concat([A,B])

with_lags = []
for id, group in df.groupby('id'):
    group = group.set_index(group.date)
    index = group.index
    group = group.reindex(pd.date_range(group.index[0],group.index[-1]))
    group = group.ffill()
    group['lag_value'] = group.value.shift(7)
    group = group.loc[index]
    with_lags.append(group)

with_lags = pd.concat(with_lags, 0)
with_lags.index = np.arange(with_lags.shape[0])
like image 130
Kevin S Avatar answered Apr 30 '26 10:04

Kevin S



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!