I have a series that contains NaN and True as a value. I want another series to generate a sequence of number, such that whenever NaN comes put that series value as 0 and In between of Two NaN rows I need to perform cumcount.
i.e.,
Input:
colA
NaN
True
True
True
True
NaN
True
NaN
NaN
True
True
True
True
True
Output
ColA Sequence
NaN 0
True 0
True 1
True 2
True 3
NaN 0
True 0
NaN 0
NaN 0
True 0
True 1
True 2
True 3
True 4
How to perform this in pandas?
To count the NaN values in a column in a Pandas DataFrame, we can use the isna() method with sum.
Counting NaN in the entire DataFrame : To count NaN in the entire dataset, we just need to call the sum() function twice – once for getting the count in each column and again for finding the total sum of all the columns.
If performace is important better is not use groupby
for count consecutive True
s:
a = df['colA'].notnull()
b = a.cumsum()
df['Sequence'] = (b-b.mask(a).add(1).ffill().fillna(0).astype(int)).where(a, 0)
print (df)
colA Sequence
0 NaN 0
1 True 0
2 True 1
3 True 2
4 True 3
5 NaN 0
6 True 0
7 NaN 0
8 NaN 0
9 True 0
10 True 1
11 True 2
12 True 3
13 True 4
Explanation:
df = pd.DataFrame({'colA':[np.nan,True,True,True,True,np.nan,
True,np.nan,np.nan,True,True,True,True,True]})
a = df['colA'].notnull()
#cumulative sum, Trues are processes like 1
b = a.cumsum()
#replace Trues from a to NaNs
c = b.mask(a)
#add 1 for count from 0
d = b.mask(a).add(1)
#forward fill NaNs, replace possible first NaNs to 0 and cast to int
e = b.mask(a).add(1).ffill().fillna(0).astype(int)
#substract b for counts
f = b-b.mask(a).add(1).ffill().fillna(0).astype(int)
#replace -1 to 0 by mask a
g = (b-b.mask(a).add(1).ffill().fillna(0).astype(int)).where(a, 0)
#all together
df = pd.concat([a,b,c,d,e,f,g], axis=1, keys=list('abcdefg'))
print (df)
a b c d e f g
0 False 0 0.0 1.0 1 -1 0
1 True 1 NaN NaN 1 0 0
2 True 2 NaN NaN 1 1 1
3 True 3 NaN NaN 1 2 2
4 True 4 NaN NaN 1 3 3
5 False 4 4.0 5.0 5 -1 0
6 True 5 NaN NaN 5 0 0
7 False 5 5.0 6.0 6 -1 0
8 False 5 5.0 6.0 6 -1 0
9 True 6 NaN NaN 6 0 0
10 True 7 NaN NaN 6 1 1
11 True 8 NaN NaN 6 2 2
12 True 9 NaN NaN 6 3 3
13 True 10 NaN NaN 6 4 4
You could use groupby
+ cumcount
+ mask
here:
m = df.colA.isnull()
df['Sequence'] = df.groupby(m.cumsum()).cumcount().sub(1).mask(m, 0)
Or, use clip_lower
in the last step and you don't have to pre-cache m
:
df['Sequence'] = df.groupby(df.colA.isnull().cumsum()).cumcount().sub(1).clip_lower(0)
df
colA Sequence
0 NaN 0
1 True 0
2 True 1
3 True 2
4 True 3
5 NaN 0
6 True 0
7 NaN 0
8 NaN 0
9 True 0
10 True 1
11 True 2
12 True 3
13 True 4
Timings
df = pd.concat([df] * 10000, ignore_index=True)
# Timing the alternatives in this answer
%%timeit
m = df.colA.isnull()
df.groupby(m.cumsum()).cumcount().sub(1).mask(m, 0)
23.3 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.groupby(df.colA.isnull().cumsum()).cumcount().sub(1).clip_lower(0)
24.1 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# @user2314737's solution
%%timeit
df.groupby((df['colA'] != df['colA'].shift(1)).cumsum()).cumcount()
29.8 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# @jezrael's solution
%%timeit
a = df['colA'].isnull()
b = a.cumsum()
(b-b.where(~a).add(1).ffill().fillna(0).astype(int)).clip_lower(0)
11.5 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Note, your mileage may vary, depending on the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With