This question shows how to count NAs in a dataframe for a particular column C. How do I count NAs for all columns (that aren't the groupby column)?
Here is some test code that doesn't work:
#!/usr/bin/env python3
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,2,2],
'b':[1,np.nan,2,np.nan],
'c':[1,np.nan,2,3]})
# result = df.groupby('a').isna().sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method
# result = df.groupby('a').transform('isna').sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method
result = df.isna().groupby('a').sum()
print(result)
# result:
# b c
# a
# False 2.0 1.0
result = df.groupby('a').apply(lambda _df: df.isna().sum())
print(result)
# result:
# a b c
# a
# 1 0 2 1
# 2 0 2 1
Desired output:
b c
a
1 1 1
2 1 0
It's always best to avoid groupby.apply
in favor of the basic functions which are cythonized, as this scales better with many groups. This will lead to a great increase in performance. In this case first check isnull()
on the entire DataFrame
then groupby
+ sum
.
df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
# b c
#a
#1 1 1
#2 1 0
To illustrate the performance gain:
import pandas as pd
import numpy as np
N = 50000
df = pd.DataFrame({'a': [*range(N//2)]*2,
'b': np.random.choice([1, np.nan], N),
'c': np.random.choice([1, np.nan], N)})
%timeit df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#7.89 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
#9.47 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Your question has the answer (You mistyped _df
as df
):
result = df.groupby('a')['b', 'c'].apply(lambda _df: _df.isna().sum())
result
b c
a
1 1 1
2 1 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With