Best way to count the number of rows with missing values in a pandas DataFrame

People also ask

How do you get the number of rows with missing data in Pandas?

You can extract rows/columns containing missing values from pandas. DataFrame by using the isnull() or isna() method that checks if an element is a missing value.

How do I count the number of rows with a specific value in Pandas?

Use Sum Function to Count Specific Values in a Column in a Dataframe. We can use the sum() function on a specified column to count values equal to a set condition, in this case we use == to get just rows equal to our specific data point.

How can we find the total number of null values from the DataFrame named dataset?

Count all NaN in a DataFrame (both columns & Rows)Calling sum() of the DataFrame returned by isnull() will give the count of total NaN in dataframe i.e.

For the second count I think just subtract the number of rows from the number of rows returned from dropna:

In [14]:

from numpy.random import randn
df = pd.DataFrame(randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
               columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df
Out[14]:
        one       two     three
a -0.209453 -0.881878  3.146375
b       NaN       NaN       NaN
c  0.049383 -0.698410 -0.482013
d       NaN       NaN       NaN
e -0.140198 -1.285411  0.547451
f -0.219877  0.022055 -2.116037
g       NaN       NaN       NaN
h -0.224695 -0.025628 -0.703680
In [18]:

df.shape[0] - df.dropna().shape[0]
Out[18]:
3

The first could be achieved using the built in methods:

In [30]:

df.isnull().values.ravel().sum()
Out[30]:
9

Timings

In [34]:

%timeit sum([True for idx,row in df.iterrows() if any(row.isnull())])
%timeit df.shape[0] - df.dropna().shape[0]
%timeit sum(map(any, df.apply(pd.isnull)))
1000 loops, best of 3: 1.55 ms per loop
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 1.82 ms per loop
In [33]:

%timeit sum(df.isnull().values.ravel())
%timeit df.isnull().values.ravel().sum()
%timeit df.isnull().sum().sum()
1000 loops, best of 3: 215 µs per loop
1000 loops, best of 3: 210 µs per loop
1000 loops, best of 3: 605 µs per loop

So my alternatives are a little faster for a df of this size

Update

So for a df with 80,000 rows I get the following:

In [39]:

%timeit sum([True for idx,row in df.iterrows() if any(row.isnull())])
%timeit df.shape[0] - df.dropna().shape[0]
%timeit sum(map(any, df.apply(pd.isnull)))
%timeit np.count_nonzero(df.isnull())
1 loops, best of 3: 9.33 s per loop
100 loops, best of 3: 6.61 ms per loop
100 loops, best of 3: 3.84 ms per loop
1000 loops, best of 3: 395 µs per loop
In [40]:

%timeit sum(df.isnull().values.ravel())
%timeit df.isnull().values.ravel().sum()
%timeit df.isnull().sum().sum()
%timeit np.count_nonzero(df.isnull().values.ravel())
1000 loops, best of 3: 675 µs per loop
1000 loops, best of 3: 679 µs per loop
100 loops, best of 3: 6.56 ms per loop
1000 loops, best of 3: 368 µs per loop

Actually np.count_nonzero wins this hands down.

So many wrong answers here. OP asked for number of rows with null values, not columns.

Here is a better example:

from numpy.random import randn
df = pd.DataFrame(randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one','two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h','asdf'])
print(df)

`Now there is obviously 4 rows with null values.

           one       two     three
a    -0.571617  0.952227  0.030825
b          NaN       NaN       NaN
c     0.627611 -0.462141  1.047515
d          NaN       NaN       NaN
e     0.043763  1.351700  1.480442
f     0.630803  0.931862  1.500602
g          NaN       NaN       NaN
h     0.729103 -1.198237 -0.207602
asdf       NaN       NaN       NaN

You would get answer as 3 (number of columns with NaNs) if you used some of the answers here. Fuentes' answer works.

Here is how I got it:

df.isnull().any(axis=1).sum()
#4
timeit df.isnull().any(axis=1).sum()
#10000 loops, best of 3: 193 µs per loop

'Fuentes':

sum(df.apply(lambda x: sum(x.isnull().values), axis = 1)>0)
#4
timeit sum(df.apply(lambda x: sum(x.isnull().values), axis = 1)>0)
#1000 loops, best of 3: 677 µs per loop

What about numpy.count_nonzero:

 np.count_nonzero(df.isnull().values)   
 np.count_nonzero(df.isnull())           # also works

count_nonzero is pretty quick. However, I constructed a dataframe from a (1000,1000) array and randomly inserted 100 nan values at different positions and measured the times of the various answers in iPython:

%timeit np.count_nonzero(df.isnull().values)
1000 loops, best of 3: 1.89 ms per loop

%timeit df.isnull().values.ravel().sum()
100 loops, best of 3: 3.15 ms per loop

%timeit df.isnull().sum().sum()
100 loops, best of 3: 15.7 ms per loop

Not a huge time improvement over the OPs original but possibly less confusing in the code, your decision. There isn't really any difference in execution time between the two count_nonzero methods (with and without .values).

A simple approach to counting the missing values in the rows or in the columns

df.apply(lambda x: sum(x.isnull().values), axis = 0) # For columns
df.apply(lambda x: sum(x.isnull().values), axis = 1) # For rows

Number of rows with at least one missing value:

sum(df.apply(lambda x: sum(x.isnull().values), axis = 1)>0)

Total missing:

df.isnull().sum().sum()

Rows with missing:

sum(map(any, df.isnull()))

Related questions
                            
                                PyCharm hangs on 'scanning files to index' background task
                            
                                How to solve import error for pandas?
                            
                                Changing variable names with Python for loops [duplicate]
                            
                                Elegant way to remove fields from nested dictionaries
                            
                                Remove all javascript tags and style tags from html with python and the lxml module
                            
                                Have Supervisord Periodically restart child processes
                            
                                What is the best way to check if a tuple has any empty/None values in Python?
                            
                                Embedding a Plotly chart in a Django template
                            
                                Sort dataframe by string length
                            
                                What is the difference between sqlite3 and sqlalchemy?
                            
                                Print Variable In Jupyter Notebook Markdown Cell Python
                            
                                python select specific elements from a list
                            
                                Exclude object's field from pickling in python
                            
                                How can I check if a Python unicode string contains non-Western letters?
                            
                                How do I run multiple Classes in a single test suite in Python using unit testing?
                            
                                Python BeautifulSoup: wildcard attribute/id search
                            
                                pypdf Merging multiple pdf files into one pdf
                            
                                Why Conda cannot call correct Python version after activating the environment?
                            
                                how to measure execution time of functions (automatically) in Python
                            
                                Speeding up pairing of strings into objects in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best way to count the number of rows with missing values in a pandas DataFrame

Tags:

python

pandas

missing-data

People also ask

Recent Activity

Donate For Us