I have a dataset based on different weather stations,
stationID | Time | Temperature | ...
----------+------+-------------+-------
123 | 1 | 30 |
123 | 2 | 31 |
202 | 1 | 24 |
202 | 2 | 24.3 |
202 | 3 | NaN |
...
And I would like to remove 'stationID' groups, which have more than a certain number of NaNs. For instance, if I type:
**>>> df.groupby('stationID')**
then, I would like to drop groups that have (at least) a certain number of NaNs (say 30) within a group. As I understand it, I cannot use dropna(thresh=10) with groupby:
**>>> df2.groupby('station').dropna(thresh=30)**
*AttributeError: Cannot access callable attribute 'dropna' of 'DataFrameGroupBy' objects...*
So, what would be the best way to do that with Pandas?
IIUC you can do df2.loc[df2.groupby('station')['Temperature'].filter(lambda x: len(x[pd.isnull(x)] ) < 30).index]
Example:
In [59]:
df = pd.DataFrame({'id':[0,0,0,1,1,1,2,2,2,2], 'val':[1,1,np.nan,1,np.nan,np.nan, 1,1,1,1]})
df
Out[59]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
3 1 1.0
4 1 NaN
5 1 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
In [64]:
df.loc[df.groupby('id')['val'].filter(lambda x: len(x[pd.isnull(x)] ) < 2).index]
Out[64]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
So this will filter out the groups that have more than 1 nan values
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With