I've encountered an interesting situation while calculating the inter-quartile range. Assuming we have a dataframe such as: <pre class="prettyprint"><code>import pandas as pd index=pd.date_range('2014 01 01',periods=10,freq='D') data=pd.np.random.randint(0,100,(10,5)) data = pd.DataFrame(index=index,data=data) data Out[90]: 0 1 2 3 4 2014-01-01 33 31 82 3 26 2014-01-02 46 59 0 34 48 2014-01-03 71 2 56 67 54 2014-01-04 90 18 71 12 2 2014-01-05 71 53 5 56 65 2014-01-06 42 78 34 54 40 2014-01-07 80 5 76 12 90 2014-01-08 60 90 84 55 78 2014-01-09 33 11 66 90 8 2014-01-10 40 8 35 36 98 # test for q1 values (this works) data.quantile(0.25) Out[111]: 0 40.50 1 8.75 2 34.25 3 17.50 4 29.50 # break it by inserting row of nans data.iloc[-1] = pd.np.NaN data.quantile(0.25) Out[115]: 0 42 1 11 2 34 3 12 4 26 </code></pre> The first quartile can be calculated by taking the median of values in the dataframe that fall below the overall median, so we can see what data.quantile(0.25) should have yielded. e.g. <pre class="prettyprint"><code>med = data.median() q1 = data[data<med].median() q1 Out[119]: 0 37.5 1 8.0 2 19.5 3 12.0 4 17.0 </code></pre> It seems that quantile is failing to provide an appropriate representation of q1 etc. since it is not doing a good job of handling the NaN values (i.e. it works without NaNs, but not with NaNs). I thought this may not be a "NaN" issue, rather it might be quantile failing to handle even-numbered data sets (i.e. where the median must be calculated as the mean of the two central numbers). However, after testing with dataframes with both even and odd-numbers of rows I saw that quantile handled these situations properly. The problem seems to arise only when NaN values are present in the dataframe. I would like to use quntile to calculate the rolling q1/q3 values in my dataframe, however, this will not work with NaN's present. Can anyone provide a solution to this issue?

Internally, <code>quantile</code> uses <code>numpy.percentile</code> over the non-null values. When you change the last row of <code>data</code> to <code>NaNs</code> you're essentially left with an array <code>array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.])</code> in the first column Calculating <code>np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.])</code> gives 42. From the docstring: <blockquote> Given a vector V of length N, the qth percentile of V is the qth ranked value in a sorted copy of V. A weighted average of the two nearest neighbors is used if the normalized ranking does not match q exactly. The same as the median if <code>q=50</code>, the same as the minimum if <code>q=0</code> and the same as the maximum if <code>q=100</code>. </blockquote>

Pandas quantile failing with NaN's present

Tags:

pandas

python-2.7

quantile

I've encountered an interesting situation while calculating the inter-quartile range. Assuming we have a dataframe such as:

import pandas as pd
index=pd.date_range('2014 01 01',periods=10,freq='D')
data=pd.np.random.randint(0,100,(10,5))
data = pd.DataFrame(index=index,data=data)

data
Out[90]: 
             0   1   2   3   4
2014-01-01  33  31  82   3  26
2014-01-02  46  59   0  34  48
2014-01-03  71   2  56  67  54
2014-01-04  90  18  71  12   2
2014-01-05  71  53   5  56  65
2014-01-06  42  78  34  54  40
2014-01-07  80   5  76  12  90
2014-01-08  60  90  84  55  78
2014-01-09  33  11  66  90   8
2014-01-10  40   8  35  36  98

# test for q1 values (this works)
data.quantile(0.25)
Out[111]: 
0    40.50
1     8.75
2    34.25
3    17.50
4    29.50

# break it by inserting row of nans
data.iloc[-1] = pd.np.NaN

data.quantile(0.25)
Out[115]: 
0    42
1    11
2    34
3    12
4    26

The first quartile can be calculated by taking the median of values in the dataframe that fall below the overall median, so we can see what data.quantile(0.25) should have yielded. e.g.

med = data.median()
q1  = data[data<med].median()
q1
Out[119]: 
0    37.5
1     8.0
2    19.5
3    12.0
4    17.0

It seems that quantile is failing to provide an appropriate representation of q1 etc. since it is not doing a good job of handling the NaN values (i.e. it works without NaNs, but not with NaNs).

I thought this may not be a "NaN" issue, rather it might be quantile failing to handle even-numbered data sets (i.e. where the median must be calculated as the mean of the two central numbers). However, after testing with dataframes with both even and odd-numbers of rows I saw that quantile handled these situations properly. The problem seems to arise only when NaN values are present in the dataframe.

I would like to use quntile to calculate the rolling q1/q3 values in my dataframe, however, this will not work with NaN's present. Can anyone provide a solution to this issue?

539

asked Jun 04 '14 19:06

tnknepp

1 Answers

Internally, quantile uses numpy.percentile over the non-null values. When you change the last row of data to NaNs you're essentially left with an array array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) in the first column

Calculating np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) gives 42.

From the docstring:

Given a vector V of length N, the qth percentile of V is the qth ranked value in a sorted copy of V. A weighted average of the two nearest neighbors is used if the normalized ranking does not match q exactly. The same as the median if q=50, the same as the minimum if q=0 and the same as the maximum if q=100.

answered Sep 18 '22 00:09

TomAugspurger

Related questions
                            
                                Should I add __future__ statements to every file on my project?
                            
                                Modify python string with contain method [duplicate]
                            
                                How do you create a multidimensional numpy array from an iterable of tuples?
                            
                                Set Python IDLE as Default Program to Open .py Extensions [closed]
                            
                                py2exe - missing module FileDialog after compiling Tkinter gui executable
                            
                                get the executed javascript content in python [duplicate]
                            
                                Setting Spotify credentials using Spotipy
                            
                                Python, NetCDF4 and HDF5
                            
                                pympler raises TypeError
                            
                                Pandas: get first 10 elements of a series
                            
                                Python Dask - vertical concatenation of 2 DataFrames
                            
                                reverse word embeddings in keras - python
                            
                                types.MethodType third argument in python2
                            
                                Issues converting rounded float to string pandas
                            
                                How to add custom metadata to OpenCV/numpy image?
                            
                                Replace one python object with another everywhere
                            
                                May I omit .pyo and .pyc files in an RPM?
                            
                                "ImportError: No module named pwd" but it exists
                            
                                Python OCR : Converting Scanned Image Into Text For Processing
                            
                                OpenCV Python : No drawMatchesknn function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With