For certain columns of <code>df</code>, if 80% of the column is <code>NAN</code>. What's the simplest code to drop such columns?

<pre class="prettyprint"><code>df.dropna(thresh=np.int((100-percent_NA_cols_required)*(len(df.columns)/100)),inplace=True) </code></pre> Basically pd.dropna takes number(int) of non_na cols required if that row is to be removed.

How to drop column according to NAN percentage for dataframe?

2 Answers

You can use isnull with mean for threshold and then remove columns by boolean indexing with loc (because remove columns), also need invert condition - so <.8 means remove all columns >=0.8:

Click to copy

df = df.loc[:, df.isnull().mean() < .8]

Sample:

Click to copy

np.random.seed(100) df = pd.DataFrame(np.random.random((100,5)), columns=list('ABCDE')) df.loc[:80, 'A'] = np.nan df.loc[:5, 'C'] = np.nan df.loc[20:, 'D'] = np.nan  print (df.isnull().mean()) A    0.81 B    0.00 C    0.06 D    0.80 E    0.00 dtype: float64  df = df.loc[:, df.isnull().mean() < .8] print (df.head())          B   C         E 0  0.278369 NaN  0.004719 1  0.670749 NaN  0.575093 2  0.209202 NaN  0.219697 3  0.811683 NaN  0.274074 4  0.940030 NaN  0.175410

If want remove columns by minimal values dropna working nice with parameter thresh and axis=1 for remove columns:

Click to copy

np.random.seed(1997) df = pd.DataFrame(np.random.choice([np.nan,1], p=(0.8,0.2),size=(10,10))) print (df)      0   1    2    3    4    5    6    7   8    9 0  NaN NaN  NaN  1.0  1.0  NaN  NaN  NaN NaN  NaN 1  1.0 NaN  1.0  NaN  NaN  NaN  NaN  NaN NaN  NaN 2  NaN NaN  NaN  NaN  NaN  1.0  1.0  NaN NaN  NaN 3  NaN NaN  NaN  NaN  1.0  NaN  NaN  NaN NaN  NaN 4  NaN NaN  NaN  NaN  NaN  1.0  NaN  NaN NaN  1.0 5  NaN NaN  NaN  1.0  1.0  NaN  NaN  1.0 NaN  1.0 6  NaN NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN  NaN 7  NaN NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN  NaN 8  NaN NaN  NaN  NaN  NaN  NaN  NaN  1.0 NaN  NaN 9  1.0 NaN  NaN  NaN  1.0  NaN  NaN  1.0 NaN  NaN  df1 = df.dropna(thresh=2, axis=1) print (df1)      0    3    4    5    7    9 0  NaN  1.0  1.0  NaN  NaN  NaN 1  1.0  NaN  NaN  NaN  NaN  NaN 2  NaN  NaN  NaN  1.0  NaN  NaN 3  NaN  NaN  1.0  NaN  NaN  NaN 4  NaN  NaN  NaN  1.0  NaN  1.0 5  NaN  1.0  1.0  NaN  1.0  1.0 6  NaN  NaN  NaN  NaN  NaN  NaN 7  NaN  NaN  NaN  NaN  NaN  NaN 8  NaN  NaN  NaN  NaN  1.0  NaN 9  1.0  NaN  1.0  NaN  1.0  NaN

EDIT: For non-Boolean data

Total number of NaN entries in a column must be less than 80% of total entries:

Click to copy

 df = df.loc[:, df.isnull().sum() < 0.8*df.shape[0]]

161

answered Sep 21 '22 05:09

jezrael

Click to copy

df.dropna(thresh=np.int((100-percent_NA_cols_required)*(len(df.columns)/100)),inplace=True)

Basically pd.dropna takes number(int) of non_na cols required if that row is to be removed.

answered Sep 22 '22 05:09

rakesh

Related questions
                            
                                Splitting list based on missing numbers in a sequence
                            
                                Python list comprehension for dictionaries in dictionaries?
                            
                                Why is Tkinter Entry's get function returning nothing?
                            
                                How to pass proxy-authentication (requires digest auth) by using python requests module
                            
                                What does this mean: key=lambda x: x[1] ?
                            
                                What is the best way to convert a SymPy matrix to a numpy array/matrix
                            
                                Simpler way to draw a circle with tkinter?
                            
                                datetime and timezone conversion with pytz - mind blowing behaviour
                            
                                How to define an unsigned integer in SQLAlchemy
                            
                                Why is the output of werkzeugs `generate_password_hash` not constant?
                            
                                how to filter json array in python
                            
                                Matplotlib Crashing tkinter Application
                            
                                view and then close the figure automatically in matplotlib?
                            
                                Printing on the same line on a jupyter notebook
                            
                                Python pandas: mean and sum groupby on different columns at the same time
                            
                                Django: Does unique_together imply db_index=True in the same way that ForeignKey does?
                            
                                Fit a gaussian function
                            
                                "SSL: certificate_verify_failed" error when scraping https://www.thenewboston.com/
                            
                                Remove non-business days rows from pandas dataframe
                            
                                Failing to import itertools in Python 3.5.2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to drop column according to NAN percentage for dataframe?

Tags:

python

pandas

dataframe

nan

LookIntoEast

People also ask

2 Answers

jezrael

rakesh

Recent Activity

Donate For Us