Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dropna - store dropped rows

Tags:

I am using the pandas.DataFrame.dropna method to drop rows that contain NaN. This function returns a dataframe that excludes the dropped rows, as shown in the documentation.

How can I store a copy of the dropped rows as a separate dataframe? Is:

mydataframe[pd.isnull(['list', 'of', 'columns'])] 

always guaranteed to return the same rows that dropna drops, assuming that dropna is called with subset=['list', 'of', 'columns'] ?

like image 828
wesanyer Avatar asked Dec 15 '15 18:12

wesanyer


People also ask

Does Dropna remove rows?

dropna() also gives you the option to remove the rows by searching for null or missing values on specified columns. To search for null values in specific columns, pass the column names to the subset parameter.

What does Dropna () do in Python?

The dropna() method removes the rows that contains NULL values. The dropna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the dropna() method does the removing in the original DataFrame instead.

Does Dropna remove NaN?

It removes rows that have NaN values in the corresponding columns. I will use the same dataframe that was created in Step 2. After removing NaN values from the dataframe you have to finally modify your dataframe. It can be done by passing the inplace =True inside the dropna() method.


2 Answers

You can do this by indexing the original DataFrame by using the unary ~ (invert) operator to give the inverse of the NA free DataFrame.

na_free = df.dropna() only_na = df[~df.index.isin(na_free.index)] 

Another option would be to use the ufunc implementation of ~.

only_na = df[np.invert(df.index.isin(na_free.index))] 
like image 122
anmol Avatar answered Oct 13 '22 15:10

anmol


I was going to leave a comment, but figured I'd write an answer as it started getting fairly complicated. Start with the following data frame:

import pandas as pd import numpy as np df = pd.DataFrame([['a', 'b', np.nan], [np.nan, 'c', 'c'], ['c', 'd', 'a']],               columns=['col1', 'col2', 'col3']) df   col1 col2 col3 0    a    b  NaN 1  NaN    c    c 2    c    d    a 

And say we want to keep rows with Nans in the columns col2 and col3 One way to do this is the following: which is based on the answers from this post

df.loc[pd.isnull(df[['col2', 'col3']]).any(axis=1)]    col1 col2 col3 0    a    b  NaN 

So this gives us the rows that would be dropped if we dropped rows with Nans in the columns of interest. To keep the columns we can run the same code, but use a ~ to invert the selection

df.loc[~pd.isnull(df[['col2', 'col3']]).any(axis=1)]    col1 col2 col3 1  NaN    c    c 2    c    d    a 

this is equivalent to:

df.dropna(subset=['col2', 'col3']) 

Which we can test:

df.dropna(subset=['col2', 'col3']).equals(df.loc[~pd.isnull(df[['col2', 'col3']]).any(axis=1)])  True 

You can of course test this on your own larger dataframes but should get the same answer.

like image 24
johnchase Avatar answered Oct 13 '22 14:10

johnchase