You can extract rows/columns containing missing values from pandas. DataFrame by using the isnull() or isna() method that checks if an element is a missing value.
Use Sum Function to Count Specific Values in a Column in a Dataframe. We can use the sum() function on a specified column to count values equal to a set condition, in this case we use == to get just rows equal to our specific data point.
Count all NaN in a DataFrame (both columns & Rows)Calling sum() of the DataFrame returned by isnull() will give the count of total NaN in dataframe i.e.
For the second count I think just subtract the number of rows from the number of rows returned from dropna
:
In [14]:
from numpy.random import randn
df = pd.DataFrame(randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df
Out[14]:
one two three
a -0.209453 -0.881878 3.146375
b NaN NaN NaN
c 0.049383 -0.698410 -0.482013
d NaN NaN NaN
e -0.140198 -1.285411 0.547451
f -0.219877 0.022055 -2.116037
g NaN NaN NaN
h -0.224695 -0.025628 -0.703680
In [18]:
df.shape[0] - df.dropna().shape[0]
Out[18]:
3
The first could be achieved using the built in methods:
In [30]:
df.isnull().values.ravel().sum()
Out[30]:
9
Timings
In [34]:
%timeit sum([True for idx,row in df.iterrows() if any(row.isnull())])
%timeit df.shape[0] - df.dropna().shape[0]
%timeit sum(map(any, df.apply(pd.isnull)))
1000 loops, best of 3: 1.55 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.82 ms per loop
In [33]:
%timeit sum(df.isnull().values.ravel())
%timeit df.isnull().values.ravel().sum()
%timeit df.isnull().sum().sum()
1000 loops, best of 3: 215 µs per loop
1000 loops, best of 3: 210 µs per loop
1000 loops, best of 3: 605 µs per loop
So my alternatives are a little faster for a df of this size
Update
So for a df with 80,000 rows I get the following:
In [39]:
%timeit sum([True for idx,row in df.iterrows() if any(row.isnull())])
%timeit df.shape[0] - df.dropna().shape[0]
%timeit sum(map(any, df.apply(pd.isnull)))
%timeit np.count_nonzero(df.isnull())
1 loops, best of 3: 9.33 s per loop
100 loops, best of 3: 6.61 ms per loop
100 loops, best of 3: 3.84 ms per loop
1000 loops, best of 3: 395 µs per loop
In [40]:
%timeit sum(df.isnull().values.ravel())
%timeit df.isnull().values.ravel().sum()
%timeit df.isnull().sum().sum()
%timeit np.count_nonzero(df.isnull().values.ravel())
1000 loops, best of 3: 675 µs per loop
1000 loops, best of 3: 679 µs per loop
100 loops, best of 3: 6.56 ms per loop
1000 loops, best of 3: 368 µs per loop
Actually np.count_nonzero
wins this hands down.
So many wrong answers here. OP asked for number of rows with null values, not columns.
Here is a better example:
from numpy.random import randn
df = pd.DataFrame(randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one','two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h','asdf'])
print(df)
`Now there is obviously 4 rows with null values.
one two three
a -0.571617 0.952227 0.030825
b NaN NaN NaN
c 0.627611 -0.462141 1.047515
d NaN NaN NaN
e 0.043763 1.351700 1.480442
f 0.630803 0.931862 1.500602
g NaN NaN NaN
h 0.729103 -1.198237 -0.207602
asdf NaN NaN NaN
You would get answer as 3 (number of columns with NaNs) if you used some of the answers here. Fuentes' answer works.
Here is how I got it:
df.isnull().any(axis=1).sum()
#4
timeit df.isnull().any(axis=1).sum()
#10000 loops, best of 3: 193 µs per loop
'Fuentes':
sum(df.apply(lambda x: sum(x.isnull().values), axis = 1)>0)
#4
timeit sum(df.apply(lambda x: sum(x.isnull().values), axis = 1)>0)
#1000 loops, best of 3: 677 µs per loop
What about numpy.count_nonzero
:
np.count_nonzero(df.isnull().values)
np.count_nonzero(df.isnull()) # also works
count_nonzero
is pretty quick. However, I constructed a dataframe from a (1000,1000) array and randomly inserted 100 nan values at different positions and measured the times of the various answers in iPython:
%timeit np.count_nonzero(df.isnull().values)
1000 loops, best of 3: 1.89 ms per loop
%timeit df.isnull().values.ravel().sum()
100 loops, best of 3: 3.15 ms per loop
%timeit df.isnull().sum().sum()
100 loops, best of 3: 15.7 ms per loop
Not a huge time improvement over the OPs original but possibly less confusing in the code, your decision. There isn't really any difference in execution time
between the two count_nonzero
methods (with and without .values
).
A simple approach to counting the missing values in the rows or in the columns
df.apply(lambda x: sum(x.isnull().values), axis = 0) # For columns
df.apply(lambda x: sum(x.isnull().values), axis = 1) # For rows
Number of rows with at least one missing value:
sum(df.apply(lambda x: sum(x.isnull().values), axis = 1)>0)
Total missing:
df.isnull().sum().sum()
Rows with missing:
sum(map(any, df.isnull()))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With