Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the first index of a pandas DataFrame for which several undefined columns are not null?

I have a dataframe with several columns. I want to get the first row index for which:

  • the value of column A is not null
  • there exists at least n other columns for which the values are not null

example: if my dataframe is:

          Date          A             B      C     D
0   2015-01-02          NaN           1      1    NaN
1   2015-01-02          NaN           2      2    NaN
2   2015-01-02          NaN           3      3    NaN
3   2015-01-02          1            NaN     4    NaN
5   2015-01-02          NaN           2      NaN  NaN
6   2015-01-03          1            NaN     6    NaN
7   2015-01-03          1             1      6    NaN
8   2015-01-03          1             1      6     8

If n=1 I would get 3

If n=2 I would get 7

If n=3 I would get 8

like image 748
astudentofmaths Avatar asked Mar 09 '23 07:03

astudentofmaths


2 Answers

Here's one approach to get those indices for different n's in one go -

def numpy_approach(df, reference='A'):
    df0 = df.iloc[:,df.columns != 'Date']
    valid_mask = df0.columns != reference
    mask = ~np.isnan(df0.values)
    count = mask[:,valid_mask].sum(1) * mask[:,(~valid_mask).argmax()]
    idx0 = np.searchsorted(np.maximum.accumulate(count),[1,2,3])
    return df.index[idx0]

Sample runs -

In [555]: df
Out[555]: 
         Date    A    B    C    D
0  2015-01-02  NaN  1.0  1.0  NaN
1  2015-01-02  NaN  2.0  2.0  NaN
2  2015-01-02  NaN  3.0  3.0  NaN
3  2015-01-02  1.0  NaN  4.0  NaN
5  2015-01-02  NaN  2.0  NaN  NaN
6  2015-01-03  1.0  NaN  6.0  NaN
7  2015-01-03  1.0  1.0  6.0  NaN
8  2015-01-03  1.0  1.0  6.0  8.0

In [556]: numpy_approach(df, reference='A')
Out[556]: Int64Index([3, 7, 8], dtype='int64')

In [557]: numpy_approach(df, reference='B')
Out[557]: Int64Index([0, 7, 8], dtype='int64')

In [558]: numpy_approach(df, reference='C')
Out[558]: Int64Index([0, 7, 8], dtype='int64')

In [568]: numpy_approach(df, reference='D')
Out[568]: Int64Index([8, 8, 8], dtype='int64')
like image 194
Divakar Avatar answered Apr 29 '23 15:04

Divakar


You can first select by A where not NaN and columns for count by loc, then get sum per row of notnull values with substract 1 for column A.

Last use boolean mask with idxmax:

a = df.loc[df['A'].notnull(), 'A':].notnull().sum(axis=1).sub(1)
print (a)
3    1
6    1
7    2
8    3
dtype: int64

N = 1
print ((a == N).idxmax())
3

N = 2
print ((a == N).idxmax())
7

N = 3
print ((a == N).idxmax())
8

print (df.loc[df['A'].notnull(), 'A':])
     A    B    C    D
3  1.0  NaN  4.0  NaN
6  1.0  NaN  6.0  NaN
7  1.0  1.0  6.0  NaN
8  1.0  1.0  6.0  8.0
like image 34
jezrael Avatar answered Apr 29 '23 14:04

jezrael