Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas quantile failing with NaN's present

I've encountered an interesting situation while calculating the inter-quartile range. Assuming we have a dataframe such as:

import pandas as pd
index=pd.date_range('2014 01 01',periods=10,freq='D')
data=pd.np.random.randint(0,100,(10,5))
data = pd.DataFrame(index=index,data=data)

data
Out[90]: 
             0   1   2   3   4
2014-01-01  33  31  82   3  26
2014-01-02  46  59   0  34  48
2014-01-03  71   2  56  67  54
2014-01-04  90  18  71  12   2
2014-01-05  71  53   5  56  65
2014-01-06  42  78  34  54  40
2014-01-07  80   5  76  12  90
2014-01-08  60  90  84  55  78
2014-01-09  33  11  66  90   8
2014-01-10  40   8  35  36  98

# test for q1 values (this works)
data.quantile(0.25)
Out[111]: 
0    40.50
1     8.75
2    34.25
3    17.50
4    29.50

# break it by inserting row of nans
data.iloc[-1] = pd.np.NaN

data.quantile(0.25)
Out[115]: 
0    42
1    11
2    34
3    12
4    26

The first quartile can be calculated by taking the median of values in the dataframe that fall below the overall median, so we can see what data.quantile(0.25) should have yielded. e.g.

med = data.median()
q1  = data[data<med].median()
q1
Out[119]: 
0    37.5
1     8.0
2    19.5
3    12.0
4    17.0

It seems that quantile is failing to provide an appropriate representation of q1 etc. since it is not doing a good job of handling the NaN values (i.e. it works without NaNs, but not with NaNs).

I thought this may not be a "NaN" issue, rather it might be quantile failing to handle even-numbered data sets (i.e. where the median must be calculated as the mean of the two central numbers). However, after testing with dataframes with both even and odd-numbers of rows I saw that quantile handled these situations properly. The problem seems to arise only when NaN values are present in the dataframe.

I would like to use quntile to calculate the rolling q1/q3 values in my dataframe, however, this will not work with NaN's present. Can anyone provide a solution to this issue?

like image 539
tnknepp Avatar asked Jun 04 '14 19:06

tnknepp


People also ask

Does quantile ignore NaN?

This is a variant of torch. quantile() that “ignores” NaN values, computing the quantiles q as if NaN values in input did not exist. If all values in a reduced row are NaN then the quantiles for that reduction will be NaN .

Does pandas quantile ignore NaN?

Yes. pd. quantile() will ignore NaN values when calculating the quantile.


1 Answers

Internally, quantile uses numpy.percentile over the non-null values. When you change the last row of data to NaNs you're essentially left with an array array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) in the first column

Calculating np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) gives 42.

From the docstring:

Given a vector V of length N, the qth percentile of V is the qth ranked value in a sorted copy of V. A weighted average of the two nearest neighbors is used if the normalized ranking does not match q exactly. The same as the median if q=50, the same as the minimum if q=0 and the same as the maximum if q=100.

like image 75
TomAugspurger Avatar answered Sep 18 '22 00:09

TomAugspurger