I have a dataframe with several columns. I want to get the first row index for which:
example: if my dataframe is:
Date A B C D
0 2015-01-02 NaN 1 1 NaN
1 2015-01-02 NaN 2 2 NaN
2 2015-01-02 NaN 3 3 NaN
3 2015-01-02 1 NaN 4 NaN
5 2015-01-02 NaN 2 NaN NaN
6 2015-01-03 1 NaN 6 NaN
7 2015-01-03 1 1 6 NaN
8 2015-01-03 1 1 6 8
If n=1 I would get 3
If n=2 I would get 7
If n=3 I would get 8
Here's one approach to get those indices for different n's
in one go -
def numpy_approach(df, reference='A'):
df0 = df.iloc[:,df.columns != 'Date']
valid_mask = df0.columns != reference
mask = ~np.isnan(df0.values)
count = mask[:,valid_mask].sum(1) * mask[:,(~valid_mask).argmax()]
idx0 = np.searchsorted(np.maximum.accumulate(count),[1,2,3])
return df.index[idx0]
Sample runs -
In [555]: df
Out[555]:
Date A B C D
0 2015-01-02 NaN 1.0 1.0 NaN
1 2015-01-02 NaN 2.0 2.0 NaN
2 2015-01-02 NaN 3.0 3.0 NaN
3 2015-01-02 1.0 NaN 4.0 NaN
5 2015-01-02 NaN 2.0 NaN NaN
6 2015-01-03 1.0 NaN 6.0 NaN
7 2015-01-03 1.0 1.0 6.0 NaN
8 2015-01-03 1.0 1.0 6.0 8.0
In [556]: numpy_approach(df, reference='A')
Out[556]: Int64Index([3, 7, 8], dtype='int64')
In [557]: numpy_approach(df, reference='B')
Out[557]: Int64Index([0, 7, 8], dtype='int64')
In [558]: numpy_approach(df, reference='C')
Out[558]: Int64Index([0, 7, 8], dtype='int64')
In [568]: numpy_approach(df, reference='D')
Out[568]: Int64Index([8, 8, 8], dtype='int64')
You can first select by A
where not NaN
and columns for count by loc
, then get sum
per row of notnull
values with substract 1
for column A
.
Last use boolean mask with idxmax
:
a = df.loc[df['A'].notnull(), 'A':].notnull().sum(axis=1).sub(1)
print (a)
3 1
6 1
7 2
8 3
dtype: int64
N = 1
print ((a == N).idxmax())
3
N = 2
print ((a == N).idxmax())
7
N = 3
print ((a == N).idxmax())
8
print (df.loc[df['A'].notnull(), 'A':])
A B C D
3 1.0 NaN 4.0 NaN
6 1.0 NaN 6.0 NaN
7 1.0 1.0 6.0 NaN
8 1.0 1.0 6.0 8.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With