Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

calculating differences within groups

I have a DataFrame whose rows provide a value of one feature at one time. Times are identified by the time column (there's about 1000000 distinct times). Features are identified by the feature column (there's a few dozen features). There's at most one row for any combination of feature and time. At each time, only some of the features are available; the only exception is feature 0 which is available at all times. I'd like to add to that DataFrame a column that shows the value of the feature 0 at that time. Is there a reasonably fast way to do it?

For example, let's say I have

df = pd.DataFrame({
  'time': [1,1,2,2,2,3,3],
  'feature': [1,0,0,2,4,3,0],
  'value':[1,2,3,4,5,6,7],
})

I want to add a column that contains [2,2,3,3,3,7,7].

I tried to use groupby and boolean indexing but no luck.

like image 364
max Avatar asked Dec 13 '25 02:12

max


1 Answers

I'd like to add to that DataFrame a column that shows the value of the feature 0 at that time. Is there a reasonably fast way to do it?

I think that a groupby (which is quite an expensive operation) is an overkill for this. Try a merge with the values only of the 0 feature:

>>> pd.merge(
        df,
        df[df.feature == 0].drop('feature', axis=1).rename(columns={'value': 'value_0'}))
    feature     time    value   value_0
0   1   1   1   2
1   0   1   2   2
2   0   2   3   3
3   2   2   4   3
4   4   2   5   3
5   3   3   6   7
6   0   3   7   7

Edit

Per @jezrael's request, here is a timing test:

 import pandas as pd

 m = 10000

 df = pd.DataFrame({
   'time': range(m / 2) + range(m / 2),
   'feature': range(m / 2) + [0] * (m / 2),
   'value': range(m),
 })

On this input, @jezrael's solution takes 396 ms, whereas mine takes 4.03 ms.

like image 95
Ami Tavory Avatar answered Dec 15 '25 19:12

Ami Tavory



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!