I have a pandas dataframe like:
a b id
1 10 6 1
2 6 -3 1
3 -3 12 1 # First time id 1 has a b value over 10
4 4 23 2 # First time id 2 has a b value over 10
5 12 11 2
6 3 -5 2
How would make a new dataframe which first takes the id
column and then gets the first time that the column b
is over 10 so that the result would look like:
a b id
1 -3 12 1
2 4 23 2
I have a dataframe with like 2,000,000 rows and about 10,000 id
values, so a for loop is very slow.
Use first fast boolean indexing
for filtering and then groupby
+ first
:
df = df[df['b'] > 10].groupby('id', as_index=False).first()
print (df)
id a b
0 1 -3 12
1 2 4 23
Solution is a bit complicated if in some group is no greater value as 10
- need expand mask with duplicated
:
print (df)
a b id
1 7 6 3 <- no value b>10 for id=3
1 10 6 1
2 6 -3 1
3 -3 12 1
4 4 23 2
5 12 11 2
6 3 -5 2
mask = ~df['id'].duplicated(keep=False) | (df['b'] > 10)
df = df[mask].groupby('id', as_index=False).first()
print (df)
id a b
0 1 -3 12
1 2 4 23
2 3 7 6
Timings:
#[2000000 rows x 3 columns]
np.random.seed(123)
N = 2000000
df = pd.DataFrame({'id': np.random.randint(10000, size=N),
'a':np.random.randint(10, size=N),
'b':np.random.randint(15, size=N)})
#print (df)
In [284]: %timeit (df[df['b'] > 10].groupby('id', as_index=False).first())
10 loops, best of 3: 67.6 ms per loop
In [285]: %timeit (df.query("b > 10").groupby('id').head(1))
10 loops, best of 3: 107 ms per loop
In [286]: %timeit (df[df['b'] > 10].groupby('id').head(1))
10 loops, best of 3: 90 ms per loop
In [287]: %timeit df.query("b > 10").groupby('id', as_index=False).first()
10 loops, best of 3: 83.3 ms per loop
#without sorting a bit faster
In [288]: %timeit (df[df['b'] > 10].groupby('id', as_index=False, sort=False).first())
10 loops, best of 3: 62.9 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With