I'm trying to replicate some Matlab code in python. I could not find an exact equivalent to the Matlab function quantile
. What I found most close is python's mquantiles
.
Matlab example:
quantile( [ 8.60789925e-05, 1.98989354e-05 , 1.68308882e-04, 1.69379370e-04], 0.8)
...gives: 0.00016958
Same example in python:
scipy.stats.mstats.mquantiles( [8.60789925e-05, 1.98989354e-05, 1.68308882e-04, 1.69379370e-04], 0.8)
...gives 0.00016912
Does anyone know how to exactly replicate Matlab's quantile
function?
In Python, the numpy. quantile() function takes an array and a number say q between 0 and 1. It returns the value at the q th quantile.
Q = quantile( A , p ) returns quantiles of elements in input data A for the cumulative probability or probabilities p in the interval [0,1]. If A is a vector, then Q is a scalar or a vector with the same length as p . Q(i) contains the p(i) quantile.
The documentation for quantile
(under the More About => Algorithms section) gives the exact algorithm used. Here's some python code that does it for a single quantile for a flat array, using bottleneck to do partial sorting:
import numpy as np
import botteleneck as bn
def quantile(a, prob):
"""
Estimates the prob'th quantile of the values in a data array.
Uses the algorithm of matlab's quantile(), namely:
- Remove any nan values
- Take the sorted data as the (.5/n), (1.5/n), ..., (1-.5/n) quantiles.
- Use linear interpolation for values between (.5/n) and (1 - .5/n).
- Use the minimum or maximum for quantiles outside that range.
See also: scipy.stats.mstats.mquantiles
"""
a = np.asanyarray(a)
a = a[np.logical_not(np.isnan(a))].ravel()
n = a.size
if prob >= 1 - .5/n:
return a.max()
elif prob <= .5 / n:
return a.min()
# find the two bounds we're interpreting between:
# that is, find i such that (i+.5) / n <= prob <= (i+1.5)/n
t = n * prob - .5
i = np.floor(t)
# partial sort so that the ith element is at position i, with bigger ones
# to the right and smaller to the left
a = bn.partsort(a, i)
if i == t: # did we luck out and get an integer index?
return a[i]
else:
# we'll linearly interpolate between this and the next index
smaller = a[i]
larger = a[i+1:].min()
if np.isinf(smaller):
return smaller # avoid inf - inf
return smaller + (larger - smaller) * (t - i)
I only did the single-quantile, 1d case because that's all I needed. If you want several quantiles, it's probably worth just doing the full sort; to do it per-axis and knew you didn't have any nans, all you should need to do is add an axis argument to the sort and vectorize the linear interpolation bit. Doing it per-axis with nans would be a little trickier.
This code gives:
>>> quantile([ 8.60789925e-05, 1.98989354e-05 , 1.68308882e-04, 1.69379370e-04], 0.8)
0.00016905822360000001
and the matlab code gave 0.00016905822359999999
; the difference is 3e-20
. (which is less than machine precision)
Your input vector only has 4 values, which is far too few to get a good approximation of the quantiles of the underlying distribution. The discrepancy is probably the result of Matlab and SciPy using different heuristics to compute quantiles on under sampled distributions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With