Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas - remove group based on collective NaN count

Tags:

python

pandas

I have a dataset based on different weather stations for several variables (Temperature, Pressure, etc.),

stationID | Time | Temperature | Pressure |...
----------+------+-------------+----------+
123       |  1   |     30      |  1010.5  |
123       |  2   |     31      |  1009.0  |
202       |  1   |     24      |  NaN     |
202       |  2   |     24.3    |  NaN     |
202       |  3   |     NaN     |  1000.3  |
...

And I would like to remove 'stationID' groups, which have more than a certain number of NaNs (taking into account all variables in the count).

If I try,

df.loc[df.groupby('station')['temperature'].filter(lambda x: len(x[pd.isnull(x)] ) < 30).index]

it works, as shown here: Python pandas - remove groups based on NaN count threshold

But the above example takes into account 'temperature' only. So, how can I take into account the collective sum of NaNs of the available variables? i.e.: I would like to remove a group, where the collective sum of NaNs in [variable1, variable2, variable3,...] is less than a threshold.

like image 497
Michel Mesquita Avatar asked Mar 11 '23 15:03

Michel Mesquita


1 Answers

This should work:

df.groupby('stationID').filter(lambda g: g.isnull().sum().sum() < 4)

You can replace 4 with a threshold number you would like it to be.

df.groupby('stationID').filter(lambda g: g.isnull().sum().sum() < 4)

   stationID    Time    Temperature Pressure
0        123       1           30.0   1010.5
1        123       2           31.0   1009.0
2        202       1           24.0      NaN
3        202       2           24.3      NaN
4        202       3            NaN   1000.3


df.groupby('stationID').filter(lambda g: g.isnull().sum().sum() < 3)

   stationID    Time    Temperature Pressure
0        123       1           30.0   1010.5
1        123       2           31.0   1009.0
like image 69
Psidom Avatar answered Mar 24 '23 02:03

Psidom