Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas drop nan using first valid index by group

I'm working with the following DataFrame:

         Date    Id    Amount
   0    201301    1      nan
   1    201302    1      nan
   2    201303    1      100
   3    201304    1      120
   4    201305    1      nan
   5    201306    1      120
   6    201302    2      nan
   7    201303    2      150
   8    201304    2      180

I'm trying to get the first valid index of Amount by Id. Because of some reason this doesn't work:

df.groupby('Id').Amount.first_valid_index()

I'm also trying this:

df.groupby('Id').Amount.apply(lambda x: x.first_valid_index())

But my dataset is 20M+ rows, so it's taking too long and that won't work for me.

Is there any faster way to find the first index by group?

My desired output would be:

first_idx = [2,7]

Or even better:

         Date    Id    Amount

   2    201303    1      100
   3    201304    1      120
   4    201305    1      nan
   5    201306    1      120
   7    201303    2      150
   8    201304    2      180

Edit: df.groupby('Id').Amount.apply(lambda x: x.first_valid_index()) indeed works, but I have the feeling there has to be a faster option, the problem doesn't seem to be that complex.

like image 641
Juan C Avatar asked Dec 06 '25 08:12

Juan C


2 Answers

Option 1: To get just the first indexes:

df[df.Amount.notna()].groupby('Id').Date.idxmin()
# 1.42 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

outputs:

Id
1    2
2    7
Name: Date, dtype: int64

Option 2: to get the other rows, use cumsum on notna()

df[df['Amount'].notna().groupby(df['Id']).cumsum().gt(0)]
# 2.09 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Option 3: you can ffill() within group and choose those are not filled:

df[df.groupby('Id').Amount.ffill().notna()]
# 831 µs ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Output:

     Date  Id  Amount
2  201303   1   100.0
3  201304   1   120.0
4  201305   1     NaN
5  201306   1   120.0
7  201303   2   150.0
8  201304   2   180.0

Conclusion: Option 3 is the fastest!


Update: to filter both ends using Option 3:

amt_group = df.groupby('Id').Amount
df[amt_group.bfill().notna() & amt_group.ffill().notna()]
like image 55
Quang Hoang Avatar answered Dec 08 '25 22:12

Quang Hoang


Create a mask with .notnull + .cumsum to get everything after the first non-null Amount within the group. Then make a slice.

m = df.Amount.notnull().groupby(df.Id).cumsum().ge(1)

df.loc[m]
     Date  Id  Amount
2  201303   1   100.0
3  201304   1   120.0
4  201305   1     NaN
5  201306   1   120.0
7  201303   2   150.0
8  201304   2   180.0
like image 24
ALollz Avatar answered Dec 08 '25 21:12

ALollz