I have a problem and I cannot find any solution in the web or documentation, even if I think that it is very trivial.
What do I want to do?
I have a dataframe like this
CLASS FEATURE1 FEATURE2 FEATURE3
X A NaN NaN
X NaN A NaN
B A A A
I want to group by the label(CLASS) and display the number of NaN-Values that are counted in every feature so that it looks like this. The purpose of this is to get a general idea how missing values are distributed over the different classes.
CLASS FEATURE1 FEATURE2 FEATURE3
X 1 1 2
B 0 0 0
I know how to recieve the amount of nonnull-Values - df.groupby['CLASS'].count()
Is there something similar for the NaN-Values?
I tried to subtract the count() from the size() but it returned an unformatted output filled with the value NaN
Use count() by Column Namegroupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well. The below example does the grouping on Courses column and calculates count how many times each value is present.
groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. sort : Sort group keys.
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
Compute a mask with isna
, then group and find the sum:
df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()
CLASS FEATURE1 FEATURE2 FEATURE3
0 X 1.0 1.0 2.0
1 B 0.0 0.0 0.0
Another option is to subtract the size
from the count
using rsub
along the 0th axis for index aligned subtraction:
df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)
Or,
g = df.groupby('CLASS')
g.count().rsub(g.size(), axis=0)
FEATURE1 FEATURE2 FEATURE3
CLASS
B 0 0 0
X 1 1 2
There are quite a few good answers, so here are some timeits
for your perusal:
df_ = df
df = pd.concat([df_] * 10000)
%timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
%timeit df.set_index('CLASS').isna().sum(level=0)
%%timeit
g = df.groupby('CLASS')
g.count().rsub(g.size(), axis=0)
11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Actual performance depends on your data and setup, so your mileage may vary.
Update due to Future warning:
FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
df.set_index('CLASS').isna().sum(level=0)
df.set_index('CLASS').isna().groupby(level=0).sum()
You can use set_index
and sum
:
# Will be deprecated soon.. do not use. You should use above statement instead.
df.set_index('CLASS').isna().sum(level=0)
Output:
FEATURE1 FEATURE2 FEATURE3
CLASS
X 1.0 1.0 2.0
B 0.0 0.0 0.0
Using the diff between count
and size
g=df.groupby('CLASS')
-g.count().sub(g.size(),0)
FEATURE1 FEATURE2 FEATURE3
CLASS
B 0 0 0
X 1 1 2
And we can transform this question to the more generic question how to count how many NaN
in dataframe with for loop
pd.DataFrame({x: y.isna().sum()for x , y in g }).T.drop('CLASS',1)
Out[468]:
FEATURE1 FEATURE2 FEATURE3
B 0 0 0
X 1 1 2
Another solution (mostly for fun):
df.assign(
**{col: df[col].isna() for col in df.columns if col not in "CLASS"},
).groupby("CLASS").sum()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With