In a Pandas dataframe, I would like to filter out all the rows that have more than 2 NaN
s.
Essentially, I have 4 columns and I would like to keep only those rows where at least 2 columns have finite values.
Can somebody advise on how to achieve this?
Filter out NAN rows (Data selection) by using DataFrame. dropna() method. The dropna() function is also possible to drop rows with NaN values df. dropna(thresh=2) it will drop all rows where there are at least two non- NaN .
In Spark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking IS NULL or isNULL . These removes all rows with null values on state column and returns the new DataFrame. All above examples returns the same output.
You have phrased 2 slightly different questions here. In the general case, they have different answers.
I would like to keep only those rows where at least 2 columns have finite values.
df = df.dropna(thresh=2)
This keeps rows with 2 or more non-null values.
I would like to filter out all the rows that have more than 2
NaNs
df = df.dropna(thresh=df.shape[1]-2)
This filters out rows with 2 or more null values.
In your example dataframe of 4 columns, these operations are equivalent, since df.shape[1] - 2 == 2
. However, you will notice discrepancies with dataframes which do not have exactly 4 columns.
Note dropna
also has a subset
argument should you wish to include only specified columns when applying a threshold. For example:
df = df.dropna(subset=['col1', 'col2', 'col3'], thresh=2)
The following should work
df.dropna(thresh=2)
See the online docs
What we are doing here is dropping any NaN
rows, where there are 2 or more non NaN
values in a row.
Example:
In [25]:
import pandas as pd
df = pd.DataFrame({'a':[1,2,NaN,4,5], 'b':[NaN,2,NaN,4,5], 'c':[1,2,NaN,NaN,NaN], 'd':[1,2,3,NaN,5]})
df
Out[25]:
a b c d
0 1 NaN 1 1
1 2 2 2 2
2 NaN NaN NaN 3
3 4 4 NaN NaN
4 5 5 NaN 5
[5 rows x 4 columns]
In [26]:
df.dropna(thresh=2)
Out[26]:
a b c d
0 1 NaN 1 1
1 2 2 2 2
3 4 4 NaN NaN
4 5 5 NaN 5
[4 rows x 4 columns]
EDIT
For the above example it works but you should note that you would have to know the number of columns and set the thresh
value appropriately, I thought originally it meant the number of NaN
values but it actually means number of Non NaN
values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With