I have the DataFrame
df = pd.DataFrame({
'colA':['?',2,3,4,'?'],
'colB':[1,2,'?',3,4],
'colC':['?',2,3,4,5]
})
I would like to get the count the the number of '?'
in each column and return the following output -
colA - 2
colB - 1
colC - 1
Is there a way to return this output at once. Right now the only way I know how to do it is write a for loop for each column.
looks like the simple way is
df[df == '?'].count()
the result is
colA 2
colB 1
colC 1
dtype: int64
where df[df == '?']
give us DataFrame with ?
and Nan
colA colB colC
0 ? NaN ?
1 NaN NaN NaN
2 NaN ? NaN
3 NaN NaN NaN
4 ? NaN NaN
and the count non-NA cells for each column.
Please, look on the other solutions: good readable and the most faster
You can use numpy.count_nonzero
here.
pd.Series(np.count_nonzero(df.to_numpy()=='?', axis=0), index=df.columns)
# pd.Series((df.values == '?').sum(0), index=df.columns)
colA 2
colB 1
colC 1
dtype: int64
Timeit results:
Benchmarking with df
of shape (1_000_000, 3)
big_df = pd.DataFrame(df.to_numpy().repeat(200_000,axis=0))
big_df.shape
(1000000, 3)
In [186]: %timeit pd.Series(np.count_nonzero(big_df.to_numpy()=='?', axis=0), index=big_df.columns)
53.1 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [187]: %timeit big_df.eq('?').sum()
171 ms ± 7.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [188]: %timeit big_df[big_df == '?'].count()
314 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [189]: %timeit pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, big_df.values), index=big_df.columns)
174 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With