Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas dataframe drop columns by number of nan

Tags:

python

pandas

I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?

import pandas as pd
import numpy as np

dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan

print dff
like image 911
pyan Avatar asked Jun 18 '15 18:06

pyan


People also ask

How can you drop columns in python that contain NaN?

Using DataFrame.dropna() method you can drop columns with Nan (Not a Number) or None values from DataFrame. Note that by default it returns the copy of the DataFrame after removing columns. If you wanted to remove from the existing DataFrame, you should use inplace=True .

How do I drop NaN values?

By using dropna() method you can drop rows with NaN (Not a Number) and None values from pandas DataFrame. Note that by default it returns the copy of the DataFrame after removing rows. If you wanted to remove from the existing DataFrame, you should use inplace=True .

How do you drop row if all columns are NaN pandas?

Pandas Drop Rows Only With NaN Values for All Columns Using DataFrame. dropna() Method. It removes only the rows with NaN values for all fields in the DataFrame. We set how='all' in the dropna() method to let the method drop row only if all column values for the row is NaN .


2 Answers

There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:

In [13]:

dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
          A         B
0  0.517199 -0.806304
1 -0.643074  0.229602
2  0.656728  0.535155
3       NaN -0.162345
4 -0.309663 -0.783539
5  1.244725 -0.274514
6 -0.254232       NaN
7 -1.242430  0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416

So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.

like image 159
EdChum Avatar answered Oct 20 '22 06:10

EdChum


You can use a conditional list comprehension:

>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
          A         B
0 -0.819004  0.919190
1  0.922164  0.088111
2  0.188150  0.847099
3       NaN -0.053563
4  1.327250 -0.376076
5  3.724980  0.292757
6 -0.319342       NaN
7 -1.051529  0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
like image 4
Alexander Avatar answered Oct 20 '22 05:10

Alexander