Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get a summary count of missing/NaN data by column in 'pandas'?

In R I can quickly see a count of missing data using the summary command, but the equivalent pandas DataFrame method, describe does not report these values.

I gather I can do something like

len(mydata.index) - mydata.count() 

to compute the number of missing values for each column, but I wonder if there's a better idiom (or if my approach is even right).

like image 692
orome Avatar asked Mar 07 '14 18:03

orome


People also ask

Which can be used to count total NaN values in a DataFrame?

Counting NaN in the entire DataFrame : To count NaN in the entire dataset, we just need to call the sum() function twice – once for getting the count in each column and again for finding the total sum of all the columns.

Which Summary function gives the number of non null values from a DataFrame?

DataFrame - info() function. The info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

Which attribute is used with series to count total number of NaN values?

Count total NaN at each column in DataFrame Calling sum() of the DataFrame returned by isnull() will give a series containing data about count of NaN in each column i.e.

Does count include NaN pandas?

Count non-NA cells for each column or row. The values None , NaN , NaT , and optionally numpy.


2 Answers

Both describe and info report the count of non-missing values.

In [1]: df = DataFrame(np.random.randn(10,2))  In [2]: df.iloc[3:6,0] = np.nan  In [3]: df Out[3]:            0         1 0 -0.560342  1.862640 1 -1.237742  0.596384 2  0.603539 -1.561594 3       NaN  3.018954 4       NaN -0.046759 5       NaN  0.480158 6  0.113200 -0.911159 7  0.990895  0.612990 8  0.668534 -0.701769 9 -0.607247 -0.489427  [10 rows x 2 columns]  In [4]: df.describe() Out[4]:                0          1 count  7.000000  10.000000 mean  -0.004166   0.286042 std    0.818586   1.363422 min   -1.237742  -1.561594 25%   -0.583795  -0.648684 50%    0.113200   0.216699 75%    0.636036   0.608839 max    0.990895   3.018954  [8 rows x 2 columns]   In [5]: df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 10 entries, 0 to 9 Data columns (total 2 columns): 0    7 non-null float64 1    10 non-null float64 dtypes: float64(2) 

To get a count of missing, your soln is correct

In [20]: len(df.index)-df.count() Out[20]:  0    3 1    0 dtype: int64 

You could do this too

In [23]: df.isnull().sum() Out[23]:  0    3 1    0 dtype: int64 
like image 109
Jeff Avatar answered Oct 07 '22 12:10

Jeff


As a tiny addition, to get percentage missing by DataFrame column, combining @Jeff and @userS's answers above gets you:

df.isnull().sum()/len(df)*100 
like image 25
Ricky McMaster Avatar answered Oct 07 '22 13:10

Ricky McMaster