NumPy percentile function different from MATLAB's percentile function

Tags:

When I try to calculate the 75th percentile in MATLAB, I get a different value than I do in NumPy.

MATLAB:

>> x = [ 11.308 ;   7.2896;   7.548 ;  11.325 ;   5.7822;   9.6343;
     7.7117;   7.3341;  10.398 ;   6.9675;  10.607 ;  13.125 ;
     7.819 ;   8.649 ;   8.3106;  12.129 ;  12.406 ;  10.935 ;
    12.544 ;   8.177 ]

>> prctile(x, 75)

ans =

11.3165

Python + NumPy:

>>> import numpy as np

>>> x = np.array([ 11.308 ,   7.2896,   7.548 ,  11.325 ,   5.7822,   9.6343,
     7.7117,   7.3341,  10.398 ,   6.9675,  10.607 ,  13.125 ,
     7.819 ,   8.649 ,   8.3106,  12.129 ,  12.406 ,  10.935 ,
    12.544 ,   8.177 ])

>>> np.percentile(x, 75)
11.312249999999999

I've checked the answer with R too, and I'm getting NumPy's answer.

> x <- c(11.308 ,   7.2896,   7.548 ,  11.325 ,   5.7822,   9.6343,
+          7.7117,   7.3341,  10.398 ,   6.9675,  10.607 ,  13.125 ,
+          7.819 ,   8.649 ,   8.3106,  12.129 ,  12.406 ,  10.935 ,
+         12.544 ,   8.177)
> quantile(x, 0.75)
     75% 
11.31225

What is going on here? And is there any way to make Python & R's behavior mirror MATLAB's?

778

asked Jul 15 '14 17:07

2 Answers

MATLAB apparently uses midpoint interpolation by default. NumPy and R use linear interpolation by default:

In [182]: np.percentile(x, 75, interpolation='linear')
Out[182]: 11.312249999999999

In [183]: np.percentile(x, 75, interpolation='midpoint')
Out[183]: 11.3165

The understand the difference between linear and midpoint, consider this simple example:

In [187]: np.percentile([0, 100], 75, interpolation='linear')
Out[187]: 75.0

In [188]: np.percentile([0, 100], 75, interpolation='midpoint')
Out[188]: 50.0

To compile the latest version of NumPy (using Ubuntu):

mkdir $HOME/src
git clone https://github.com/numpy/numpy.git
git remote add upstream https://github.com/numpy/numpy.git
# Read ~/src/numpy/INSTALL.txt
sudo apt-get install libatlas-base-dev libatlas3gf-base
python setup.py build --fcompiler=gnu95
python setup.py install

The advantage of using git instead of pip is that it is super easy to upgrade (or downgrade) to other versions of NumPy (and you get the source code too):

git fetch upstream
git checkout master # or checkout any other version of NumPy
cd ~/src/numpy
/bin/rm -rf build
cdsitepackages    # assuming you are using virtualenv; otherwise cd to your local python sitepackages directory
/bin/rm -rf numpy numpy-*-py2.7.egg-info
cd ~/src/numpy
python setup.py build --fcompiler=gnu95
python setup.py install

172

answered Oct 05 '22 18:10

unutbu

Since the accepted answer is still incomplete even after @cpaulik's comment, I'm posting here what is hopefully a more complete answer (although, for brevity reasons, not perfect, see below).

Using np.percentile(x, p, interpolation='midpoint') is only going to give the same answer for very specific values, namely when p/100 is a multiple of 1/n, n being the number of elements of the array. In the original question, this was indeed the case, since n=20 and p=75, but in general the two functions differ.

A short emulation of Matlab's prctile function is given by:

def quantile(x,q):
    n = len(x)
    y = np.sort(x)
    return(np.interp(q, np.linspace(1/(2*n), (2*n-1)/(2*n), n), y))

def prctile(x,p):
    return(quantile(x,np.array(p)/100))

This function, as Matlab's one, gives a piecewise linear output spanning from min(x) to max(x). Numpy's percentile function, with interpolation=midpoint, returns a piecewise constant function between the average of the two smallest elements and the average of the two largest ones. Plotting the two functions for the array in the original question gives the picture in this link (sorry can't embed it). The dashed red line marks the 75% percentile, where the two functions actually coincide.

P.S. The reason why this function is not actually equivalent to Matlab's one is that it only accepts a one-dimensional x, giving error for higher dimensional stuff. Matlab's one, on the other hand, accepts a higher dim x and operates on the first (non trivial) dimension, but implementing it correctly would probably take a bit longer. However, both this and Matlab's function should correctly work with higher dimensional inputs for p / q (thanks to the usage of np.interp that takes care of it).

answered Oct 05 '22 17:10

Marco Spinaci

Related questions
                            
                                Timezone Information Missing in pytz?
                            
                                PyCharm SQLAlchemy autocomplete
                            
                                Changing what the ends of whiskers represent in matplotlib's boxplot function
                            
                                Streaming data from Postgres into Python
                            
                                Explaining the differences between dim, shape, rank, dimension and axis in numpy
                            
                                AttributeError: 'module' object has no attribute 'python_implementation' running pip
                            
                                Python - Check if list of lists of lists contains a specific list
                            
                                Very slow regular expression search
                            
                                How do you set up Pycharm to debug a Fabric fabfile on Windows?
                            
                                Converting string to date object without time info
                            
                                http request with timeout, maximum size and connection pooling
                            
                                osticket, create ticket through REST API
                            
                                Do I need to close connection in mongodb?
                            
                                Is os.path.join necessary?
                            
                                Flask: TypeError: 'int' object is not callable [duplicate]
                            
                                Why can't I establish connection to rabbitMQ using python?
                            
                                Add dynamic field to django admin model form
                            
                                Changing a single strings color within a QTextEdit
                            
                                SQLAlchemy one-to-one relation, primary as foreign key
                            
                                Exposing C++ functions, that return pointer using Boost.Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

NumPy percentile function different from MATLAB's percentile function

Tags:

python

r

numpy

matlab

percentile

James

People also ask

2 Answers

unutbu

Marco Spinaci

Recent Activity

Donate For Us