Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dropping columns with >N NaNs excluding specific columns

I'm wondering if the there is a consice way to do exclude all columns with more than N NaNs, excluding one column from this subset.

For example:

df = pd.DataFrame([[np.nan, 2, np.nan, 0], 
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5]],
                  columns=list('ABCD'))

Results in:

    A   B   C   D
0   NaN 2.0 NaN 0
1   3.0 4.0 NaN 1
2   NaN NaN NaN 5

Running the following, I get:

df.dropna(thresh=2, axis=1)

    B   D
0   2.0 0
1   4.0 1
2   NaN 5

I would like to keep column 'C'. I.e., to perform this thresholding except on column 'C'.

Is that possible?

like image 565
pceccon Avatar asked Jan 29 '23 13:01

pceccon


1 Answers

You can put the column back once you've done the thresholding. If you do this all on one line, you don't even need to store a reference to the column.

import pandas as pd
import numpy as np

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5]],
                  columns=list('ABCD'))
df.dropna(thresh=2, axis=1).assign(C=df['C'])

You could also do

C = df['C']
df.dropna(thresh=2, axis=1)
df.assign(C=C)

As suggested by @Wen, you can also do an indexing operation that won't remove column C to begin with.

threshold = 2
df = df.loc[:, (df.isnull().sum(0) < threshold) | (df.columns == 'C')]

The index here for the column will select columns that have fewer than threshold NaN values, or whose name is C. If you wanted to include more than just one column in the exception, you can chain more conditions with the "or" operator |. For example:

df = df.loc[
    :,
    (df.isnull().sum(0) < threshold) |
    (df.columns == 'C') |
    (df.columns == 'D')]
like image 98
Jeremy McGibbon Avatar answered Feb 03 '23 07:02

Jeremy McGibbon