How to delete a column in pandas dataframe based on a condition?

Tags:

I have a pandas DataFrame, with many NAN values in it.

How can I drop columns such that number_of_na_values > 2000?

I tried to do it like that:

toRemove = set()
naNumbersPerColumn = df.isnull().sum()
for i in naNumbersPerColumn.index:
    if(naNumbersPerColumn[i]>2000):
         toRemove.add(i)
for i in toRemove:
    df.drop(i, axis=1, inplace=True)

Is there a more elegant way to do it?

985

asked Jul 24 '15 15:07

Fedorenko Kristina

1 Answers

Here's another alternative to keep the columns that have less than or equal to the specified number of nans in each column:

max_number_of_nas = 3000
df = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nas)]

In my tests this seems to be slightly faster than the drop columns method suggested by Jianxun Li in the cases I tested (as shown below). However, I should note that the performance becomes more similar if you simply don't use the apply method (e.g. df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)). Just a reminder that when it comes to performance in pandas vectorization almost always wins out over apply.

np.random.seed(0)
df = pd.DataFrame(np.random.randn(10000,5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5010

%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 1.1 ms ± 4.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit c = df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)
>> 1.3 ms ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 2.11 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Performance often varies with data size so don't forget to check whatever case is closest to your data.

np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 5), columns=list('ABCDE'))
df[df < 0] = np.nan
max_number_of_nans = 5

%timeit c = df.loc[:, (df.isnull().sum(axis=0) <= max_number_of_nans)]
>> 755 µs ± 4.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit c = df.drop(df.columns[df.isnull().sum(axis=0) > max_number_of_nans], axis=1)
>> 777 µs ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit c = df.drop(df.columns[df.apply(lambda col: col.isnull().sum() > max_number_of_nans)], axis=1)
>> 1.71 ms ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

142

answered Sep 27 '22 21:09

n8yoder

Related questions
                            
                                pretty_print option in tostring not working in lxml
                            
                                Pandas filter rows based on multiple conditions
                            
                                FigureCanvasAgg' object has no attribute 'invalidate' ? python plotting
                            
                                Django finds tests but fail to import them
                            
                                Get all values from nested dictionaries in python
                            
                                Python unittest mock: Is it possible to mock the value of a method's default arguments at test time?
                            
                                What's causing this error when I try and install virtualenv? IOError: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/virtualenv.py'
                            
                                Making objects from a CSV file Python [closed]
                            
                                Variable-Width Lookbehind Issue in Python
                            
                                python pyodbc : how to connect to a specific instance
                            
                                How to detect Python Version 2 or 3 in script?
                            
                                Django REST: How to use Router in application level urls.py?
                            
                                How to read sql query to pandas dataframe / python / django
                            
                                Store large data or a service connection per Flask session
                            
                                How can i set the location of minor ticks in matplotlib
                            
                                Equivalent im2double function in OpenCV Python
                            
                                Efficient Matplotlib Redrawing
                            
                                What is the theorical foundation for scikit-learn dummy classifier?
                            
                                Set max number of threads at runtime on numpy/openblas
                            
                                Can I use pymysql.connect() with "with" statement?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to delete a column in pandas dataframe based on a condition?

Tags:

python

pandas

dataframe

nan

Fedorenko Kristina

People also ask

1 Answers

n8yoder

Recent Activity

Donate For Us