I've encountered an interesting situation while calculating the inter-quartile range. Assuming we have a dataframe such as:
import pandas as pd
index=pd.date_range('2014 01 01',periods=10,freq='D')
data=pd.np.random.randint(0,100,(10,5))
data = pd.DataFrame(index=index,data=data)
data
Out[90]:
0 1 2 3 4
2014-01-01 33 31 82 3 26
2014-01-02 46 59 0 34 48
2014-01-03 71 2 56 67 54
2014-01-04 90 18 71 12 2
2014-01-05 71 53 5 56 65
2014-01-06 42 78 34 54 40
2014-01-07 80 5 76 12 90
2014-01-08 60 90 84 55 78
2014-01-09 33 11 66 90 8
2014-01-10 40 8 35 36 98
# test for q1 values (this works)
data.quantile(0.25)
Out[111]:
0 40.50
1 8.75
2 34.25
3 17.50
4 29.50
# break it by inserting row of nans
data.iloc[-1] = pd.np.NaN
data.quantile(0.25)
Out[115]:
0 42
1 11
2 34
3 12
4 26
The first quartile can be calculated by taking the median of values in the dataframe that fall below the overall median, so we can see what data.quantile(0.25) should have yielded. e.g.
med = data.median()
q1 = data[data<med].median()
q1
Out[119]:
0 37.5
1 8.0
2 19.5
3 12.0
4 17.0
It seems that quantile is failing to provide an appropriate representation of q1 etc. since it is not doing a good job of handling the NaN values (i.e. it works without NaNs, but not with NaNs).
I thought this may not be a "NaN" issue, rather it might be quantile failing to handle even-numbered data sets (i.e. where the median must be calculated as the mean of the two central numbers). However, after testing with dataframes with both even and odd-numbers of rows I saw that quantile handled these situations properly. The problem seems to arise only when NaN values are present in the dataframe.
I would like to use quntile to calculate the rolling q1/q3 values in my dataframe, however, this will not work with NaN's present. Can anyone provide a solution to this issue?
This is a variant of torch. quantile() that “ignores” NaN values, computing the quantiles q as if NaN values in input did not exist. If all values in a reduced row are NaN then the quantiles for that reduction will be NaN .
Yes. pd. quantile() will ignore NaN values when calculating the quantile.
Internally, quantile
uses numpy.percentile
over the non-null values. When you change the last row of data
to NaNs
you're essentially left with an array array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.])
in the first column
Calculating np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.])
gives 42.
From the docstring:
Given a vector V of length N, the qth percentile of V is the qth ranked value in a sorted copy of V. A weighted average of the two nearest neighbors is used if the normalized ranking does not match q exactly. The same as the median if
q=50
, the same as the minimum ifq=0
and the same as the maximum ifq=100
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With