In R I can quickly see a count of missing data using the summary
command, but the equivalent pandas
DataFrame method, describe
does not report these values.
I gather I can do something like
len(mydata.index) - mydata.count()
to compute the number of missing values for each column, but I wonder if there's a better idiom (or if my approach is even right).
Counting NaN in the entire DataFrame : To count NaN in the entire dataset, we just need to call the sum() function twice – once for getting the count in each column and again for finding the total sum of all the columns.
DataFrame - info() function. The info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
Count total NaN at each column in DataFrame Calling sum() of the DataFrame returned by isnull() will give a series containing data about count of NaN in each column i.e.
Count non-NA cells for each column or row. The values None , NaN , NaT , and optionally numpy.
Both describe
and info
report the count of non-missing values.
In [1]: df = DataFrame(np.random.randn(10,2)) In [2]: df.iloc[3:6,0] = np.nan In [3]: df Out[3]: 0 1 0 -0.560342 1.862640 1 -1.237742 0.596384 2 0.603539 -1.561594 3 NaN 3.018954 4 NaN -0.046759 5 NaN 0.480158 6 0.113200 -0.911159 7 0.990895 0.612990 8 0.668534 -0.701769 9 -0.607247 -0.489427 [10 rows x 2 columns] In [4]: df.describe() Out[4]: 0 1 count 7.000000 10.000000 mean -0.004166 0.286042 std 0.818586 1.363422 min -1.237742 -1.561594 25% -0.583795 -0.648684 50% 0.113200 0.216699 75% 0.636036 0.608839 max 0.990895 3.018954 [8 rows x 2 columns] In [5]: df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 10 entries, 0 to 9 Data columns (total 2 columns): 0 7 non-null float64 1 10 non-null float64 dtypes: float64(2)
To get a count of missing, your soln is correct
In [20]: len(df.index)-df.count() Out[20]: 0 3 1 0 dtype: int64
You could do this too
In [23]: df.isnull().sum() Out[23]: 0 3 1 0 dtype: int64
As a tiny addition, to get percentage missing by DataFrame column, combining @Jeff and @userS's answers above gets you:
df.isnull().sum()/len(df)*100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With