Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop columns in a pandas dataframe based on the % of null values

Tags:

python

pandas

I have a dataframe with around 60 columns and 2 million rows. Some of the columns are mostly empty. I calculated the % of null values in each column using this function.

def missing_values_table(df): 
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum()/len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    return mis_val_table_ren_columns

Now I want to drop the columns that have more than 80%(for example) values missing. I tried the following code but it does not seem to be working.

df = df.drop(df.columns[df.apply(lambda col: col.isnull().sum()/len(df) > 0.80)], axis=1)

Thank you in advance. Hope I'm not missing something very basic

I receive this error

TypeError: ("'generator' object is not callable", u'occurred at index Unique_Key')

like image 379
user2656075 Avatar asked Oct 25 '17 18:10

user2656075


People also ask

How do I drop columns with all NA values?

If we need to drop such columns that contain NA, we can use the axis=column s parameter of DataFrame. dropna() to specify deleting the columns. By default, it removes the column where one or more values are missing.

How do I drop a column in pandas based on index?

Pandas Drop Multiple Columns By Index You can use df. columns[[index1, index2, indexn]] to identify the list of column names in that index position and pass that list to the drop method. Note that an index is 0 based. Use 0 to delete the first column and 1 to delete the second column and so on.

How do you drop rows in pandas based on multiple column values?

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.

How do you drop columns that are all NaN pandas?

By using pandas. DataFrame. dropna() method you can drop columns with Nan (Not a Number) or None values from DataFrame. Note that by default it returns the copy of the DataFrame after removing columns.


2 Answers

You can use dropna() with threshold parameter

thresh = len(df) * .2
df.dropna(thresh = thresh, axis = 1, inplace = True)
like image 179
Vaishali Avatar answered Nov 02 '22 19:11

Vaishali


def missing_values(df, percentage):

    columns = df.columns
    percent_missing = df.isnull().sum() * 100 / len(df)
    missing_value_df = pd.DataFrame({'column_name': columns,
                                 'percent_missing': percent_missing})

    missing_drop = list(missing_value_df[missing_value_df.percent_missing>percentage].column_name)
    df = df.drop(missing_drop, axis=1)
    return df
like image 23
Frederico Guerra Avatar answered Nov 02 '22 20:11

Frederico Guerra