Dropping columns with N NaNs excluding specific columns

Question

I'm wondering if the there is a consice way to do exclude all columns with more than N NaNs, excluding one column from this subset.

For example:

df = pd.DataFrame([[np.nan, 2, np.nan, 0], 
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5]],
                  columns=list('ABCD'))

Results in:

    A   B   C   D
0   NaN 2.0 NaN 0
1   3.0 4.0 NaN 1
2   NaN NaN NaN 5

Running the following, I get:

df.dropna(thresh=2, axis=1)

    B   D
0   2.0 0
1   4.0 1
2   NaN 5

I would like to keep column 'C'. I.e., to perform this thresholding except on column 'C'.

Is that possible?

Jeremy McGibbon · Accepted Answer

You can put the column back once you've done the thresholding. If you do this all on one line, you don't even need to store a reference to the column.

import pandas as pd
import numpy as np

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5]],
                  columns=list('ABCD'))
df.dropna(thresh=2, axis=1).assign(C=df['C'])

You could also do

C = df['C']
df.dropna(thresh=2, axis=1)
df.assign(C=C)

As suggested by @Wen, you can also do an indexing operation that won't remove column C to begin with.

threshold = 2
df = df.loc[:, (df.isnull().sum(0) < threshold) | (df.columns == 'C')]

The index here for the column will select columns that have fewer than threshold NaN values, or whose name is C. If you wanted to include more than just one column in the exception, you can chain more conditions with the "or" operator |. For example:

df = df.loc[
    :,
    (df.isnull().sum(0) < threshold) |
    (df.columns == 'C') |
    (df.columns == 'D')]

Dropping columns with >N NaNs excluding specific columns

Tags:

python

pandas

nan

filtering

pceccon

1 Answers

Jeremy McGibbon

Recent Activity

Donate For Us