Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Eliminating all data over a given percentile

I have a pandas DataFrame called data with a column called ms. I want to eliminate all the rows where data.ms is above the 95% percentile. For now, I'm doing this:

limit = data.ms.describe(90)['95%'] valid_data = data[data['ms'] < limit] 

which works, but I want to generalize that to any percentile. What's the best way to do that?

like image 406
Roy Smith Avatar asked Sep 02 '13 20:09

Roy Smith


People also ask

What is the difference between quantile and percentile?

percentile: a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. quantile: values taken from regular intervals of the quantile function of a random variable.

Why do we use 95th percentile?

Why use it? The reason the 95th percentile is so useful in measuring network usage is because it provides an accurate picture of how much it costs. By knowing the value of your network's 95th percentile, it's easy to identify spikes in usage.


1 Answers

Use the Series.quantile() method:

In [48]: cols = list('abc')  In [49]: df = DataFrame(randn(10, len(cols)), columns=cols)  In [50]: df.a.quantile(0.95) Out[50]: 1.5776961953820687 

To filter out rows of df where df.a is greater than or equal to the 95th percentile do:

In [72]: df[df.a < df.a.quantile(.95)] Out[72]:        a      b      c 0 -1.044 -0.247 -1.149 2  0.395  0.591  0.764 3 -0.564 -2.059  0.232 4 -0.707 -0.736 -1.345 5  0.978 -0.099  0.521 6 -0.974  0.272 -0.649 7  1.228  0.619 -0.849 8 -0.170  0.458 -0.515 9  1.465  1.019  0.966 
like image 166
Phillip Cloud Avatar answered Sep 20 '22 06:09

Phillip Cloud