Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter out rows with more than certain number of NaN

In a Pandas dataframe, I would like to filter out all the rows that have more than 2 NaNs.

Essentially, I have 4 columns and I would like to keep only those rows where at least 2 columns have finite values.

Can somebody advise on how to achieve this?

like image 970
AMM Avatar asked Apr 21 '14 18:04

AMM


People also ask

How do you filter out NaN rows?

Filter out NAN rows (Data selection) by using DataFrame. dropna() method. The dropna() function is also possible to drop rows with NaN values df. dropna(thresh=2) it will drop all rows where there are at least two non- NaN .

How do you filter NULL values in a DataFrame?

In Spark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking IS NULL or isNULL . These removes all rows with null values on state column and returns the new DataFrame. All above examples returns the same output.


2 Answers

You have phrased 2 slightly different questions here. In the general case, they have different answers.

I would like to keep only those rows where at least 2 columns have finite values.

df = df.dropna(thresh=2)

This keeps rows with 2 or more non-null values.


I would like to filter out all the rows that have more than 2 NaNs

df = df.dropna(thresh=df.shape[1]-2)

This filters out rows with 2 or more null values.

In your example dataframe of 4 columns, these operations are equivalent, since df.shape[1] - 2 == 2. However, you will notice discrepancies with dataframes which do not have exactly 4 columns.


Note dropna also has a subset argument should you wish to include only specified columns when applying a threshold. For example:

df = df.dropna(subset=['col1', 'col2', 'col3'], thresh=2)
like image 181
jpp Avatar answered Oct 17 '22 17:10

jpp


The following should work

df.dropna(thresh=2)

See the online docs

What we are doing here is dropping any NaN rows, where there are 2 or more non NaN values in a row.

Example:

In [25]:

import pandas as pd

df = pd.DataFrame({'a':[1,2,NaN,4,5], 'b':[NaN,2,NaN,4,5], 'c':[1,2,NaN,NaN,NaN], 'd':[1,2,3,NaN,5]})

df

Out[25]:

    a   b   c   d
0   1 NaN   1   1
1   2   2   2   2
2 NaN NaN NaN   3
3   4   4 NaN NaN
4   5   5 NaN   5

[5 rows x 4 columns]

In [26]:

df.dropna(thresh=2)

Out[26]:

   a   b   c   d
0  1 NaN   1   1
1  2   2   2   2
3  4   4 NaN NaN
4  5   5 NaN   5

[4 rows x 4 columns]

EDIT

For the above example it works but you should note that you would have to know the number of columns and set the thresh value appropriately, I thought originally it meant the number of NaN values but it actually means number of Non NaN values.

like image 11
EdChum Avatar answered Oct 17 '22 19:10

EdChum