I'm working on a machine learning problem in which there are many missing values in the features. There are 100's of features and I would like to remove those features that have too many missing values (it can be features with more than 80% missing values). How can I do that in Python?
My data is a Pandas dataframe.
DataFrame. dropna() is used to drop/remove columns with NaN / None values. Python doesn't support Null hence any missing data is represented as None or NaN values.
Drop column with missing values in place For that, we can use a flag inplace of DataFrame. dropna() . If the inplace=True , then it updates the DataFrame and returns None. If inplace=False , it returns the updated copy of the DataFrame.
loc() to Remove Columns Between Specified Columns. Drop() method using loc[] function to remove all columns between a specific column name to another column's name. Use [ : , 'Courses':'Fee'] to drop the one and second columns. inplace option would work on the original object.
You can use Pandas' dropna().
limitPer = len(yourdf) * .80
yourdf = yourdf.dropna(thresh=limitPer, axis=1)
To generalize within Pandas you can do the following to calculate the percent of values in a column with missing values. From those columns you can filter out the features with more than 80% NULL values and then drop those columns from the DataFrame.
pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null > 0.80].index
df.drop(missing_features, axis=1, inplace=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With