I am reading in a bunch of CSV files (measurement data for water levels over time) to do various analysis and visualizations on them.
Due to various reasons beyond my control, these time series often have missing data, so I do two things:
I count them in total with
Rlength = len(RainD) # Counts everything, including NaN
Rcount = RainD.count() # Counts only valid numbers
NaN_Number = Rlength - Rcount
and discard the dataset if I have more missing data than a certain threshold:
Percent_Data = Rlength/100
Five_Percent = Percent_Data*5
if NaN_Number > Five_Percent:
...
If the number of NaN is sufficiently small, I would like to fill the gaps with
RainD.level = RainD.level.fillna(method='pad', limit=2)
And now for the issue: It's monthly data, so if I have more than two consecutive NaNs, I also want to discard the data, since that would mean that I "guess" a whole season, or even more.
The documentation for fillna
doesn't really mention what happens when there is more consecutive NaNs than my specified limit=2
, but when I look at RainD.describe()
before and after ...fillna...
and compare it with the base CSV, it's clear that it fills the first two NaNs, and then leaves the rest as it is, instead of erroring out.
So, long story short:
How do I identify a number of consecutive NaNs with Pandas, without some complicated and time consuming non-Pandas loop?
You can use multiple boolean conditions to test if the current value and previous value are NaN
:
In [3]:
df = pd.DataFrame({'a':[1,3,np.NaN, np.NaN, 4, np.NaN, 6,7,8]})
df
Out[3]:
a
0 1
1 3
2 NaN
3 NaN
4 4
5 NaN
6 6
7 7
8 8
In [6]:
df[(df.a.isnull()) & (df.a.shift().isnull())]
Out[6]:
a
3 NaN
If you wanted to find where consecutive NaNs
occur where you are looking for more than 2 you could do the following:
In [38]:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df
Out[38]:
a
0 1
1 2
2 NaN
3 NaN
4 NaN
5 6
6 7
7 8
8 9
9 10
10 NaN
11 NaN
12 13
13 14
In [41]:
df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
Out[41]:
a
1 0
2 3
3 0
4 0
5 0
6 0
7 2
8 0
9 0
Name: a, dtype: int32
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With