I have a DataFrame whose rows provide a value of one feature at one time. Times are identified by the time column (there's about 1000000 distinct times). Features are identified by the feature column (there's a few dozen features). There's at most one row for any combination of feature and time. At each time, only some of the features are available; the only exception is feature 0 which is available at all times. I'd like to add to that DataFrame a column that shows the value of the feature 0 at that time. Is there a reasonably fast way to do it?
For example, let's say I have
df = pd.DataFrame({
'time': [1,1,2,2,2,3,3],
'feature': [1,0,0,2,4,3,0],
'value':[1,2,3,4,5,6,7],
})
I want to add a column that contains [2,2,3,3,3,7,7].
I tried to use groupby and boolean indexing but no luck.
I'd like to add to that DataFrame a column that shows the value of the feature 0 at that time. Is there a reasonably fast way to do it?
I think that a groupby (which is quite an expensive operation) is an overkill for this. Try a merge with the values only of the 0 feature:
>>> pd.merge(
df,
df[df.feature == 0].drop('feature', axis=1).rename(columns={'value': 'value_0'}))
feature time value value_0
0 1 1 1 2
1 0 1 2 2
2 0 2 3 3
3 2 2 4 3
4 4 2 5 3
5 3 3 6 7
6 0 3 7 7
Edit
Per @jezrael's request, here is a timing test:
import pandas as pd
m = 10000
df = pd.DataFrame({
'time': range(m / 2) + range(m / 2),
'feature': range(m / 2) + [0] * (m / 2),
'value': range(m),
})
On this input, @jezrael's solution takes 396 ms, whereas mine takes 4.03 ms.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With