I have a dataset based on different weather stations for several variables (Temperature, Pressure, etc.),
stationID | Time | Temperature | Pressure |...
----------+------+-------------+----------+
123 | 1 | 30 | 1010.5 |
123 | 2 | 31 | 1009.0 |
202 | 1 | 24 | NaN |
202 | 2 | 24.3 | NaN |
202 | 3 | NaN | 1000.3 |
...
And I would like to remove 'stationID' groups, which have more than a certain number of NaNs (taking into account all variables in the count).
If I try,
df.loc[df.groupby('station')['temperature'].filter(lambda x: len(x[pd.isnull(x)] ) < 30).index]
it works, as shown here: Python pandas - remove groups based on NaN count threshold
But the above example takes into account 'temperature' only. So, how can I take into account the collective sum of NaNs of the available variables? i.e.: I would like to remove a group, where the collective sum of NaNs in [variable1, variable2, variable3,...] is less than a threshold.
This should work:
df.groupby('stationID').filter(lambda g: g.isnull().sum().sum() < 4)
You can replace 4
with a threshold number you would like it to be.
df.groupby('stationID').filter(lambda g: g.isnull().sum().sum() < 4)
stationID Time Temperature Pressure
0 123 1 30.0 1010.5
1 123 2 31.0 1009.0
2 202 1 24.0 NaN
3 202 2 24.3 NaN
4 202 3 NaN 1000.3
df.groupby('stationID').filter(lambda g: g.isnull().sum().sum() < 3)
stationID Time Temperature Pressure
0 123 1 30.0 1010.5
1 123 2 31.0 1009.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With