Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove columns with too many missing values in Python

I'm working on a machine learning problem in which there are many missing values in the features. There are 100's of features and I would like to remove those features that have too many missing values (it can be features with more than 80% missing values). How can I do that in Python?

My data is a Pandas dataframe.

like image 645
HHH Avatar asked Aug 04 '17 20:08

HHH


People also ask

How do you remove columns with all NA in Python?

DataFrame. dropna() is used to drop/remove columns with NaN / None values. Python doesn't support Null hence any missing data is represented as None or NaN values.

How do you drop columns with percentage of missing values in Pandas?

Drop column with missing values in place For that, we can use a flag inplace of DataFrame. dropna() . If the inplace=True , then it updates the DataFrame and returns None. If inplace=False , it returns the updated copy of the DataFrame.

How do I remove multiple columns from a dataset in Python?

loc() to Remove Columns Between Specified Columns. Drop() method using loc[] function to remove all columns between a specific column name to another column's name. Use [ : , 'Courses':'Fee'] to drop the one and second columns. inplace option would work on the original object.


2 Answers

You can use Pandas' dropna().

limitPer = len(yourdf) * .80
yourdf = yourdf.dropna(thresh=limitPer, axis=1)
like image 152
singmotor Avatar answered Oct 17 '22 08:10

singmotor


To generalize within Pandas you can do the following to calculate the percent of values in a column with missing values. From those columns you can filter out the features with more than 80% NULL values and then drop those columns from the DataFrame.

pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null > 0.80].index
df.drop(missing_features, axis=1, inplace=True)
like image 23
vielkind Avatar answered Oct 17 '22 07:10

vielkind