I have a dataframe with around 60 columns and 2 million rows. Some of the columns are mostly empty. I calculated the % of null values in each column using this function.
def missing_values_table(df):
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum()/len(df)
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
return mis_val_table_ren_columns
Now I want to drop the columns that have more than 80%(for example) values missing. I tried the following code but it does not seem to be working.
df = df.drop(df.columns[df.apply(lambda col: col.isnull().sum()/len(df) > 0.80)], axis=1)
Thank you in advance. Hope I'm not missing something very basic
I receive this error
TypeError: ("'generator' object is not callable", u'occurred at index Unique_Key')
If we need to drop such columns that contain NA, we can use the axis=column s parameter of DataFrame. dropna() to specify deleting the columns. By default, it removes the column where one or more values are missing.
Pandas Drop Multiple Columns By Index You can use df. columns[[index1, index2, indexn]] to identify the list of column names in that index position and pass that list to the drop method. Note that an index is 0 based. Use 0 to delete the first column and 1 to delete the second column and so on.
Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.
By using pandas. DataFrame. dropna() method you can drop columns with Nan (Not a Number) or None values from DataFrame. Note that by default it returns the copy of the DataFrame after removing columns.
You can use dropna() with threshold parameter
thresh = len(df) * .2
df.dropna(thresh = thresh, axis = 1, inplace = True)
def missing_values(df, percentage):
columns = df.columns
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': columns,
'percent_missing': percent_missing})
missing_drop = list(missing_value_df[missing_value_df.percent_missing>percentage].column_name)
df = df.drop(missing_drop, axis=1)
return df
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With