For a given dataset in a data frame, when I apply the describe
function, I get the basic stats which include min, max, 25%, 50% etc.
For example:
data_1 = pd.DataFrame({'One':[4,6,8,10]},columns=['One'])
data_1.describe()
The output is:
One
count 4.000000
mean 7.000000
std 2.581989
min 4.000000
25% 5.500000
50% 7.000000
75% 8.500000
max 10.000000
My question is: What is the mathematical formula to calculate the 25%?
1) Based on what I know, it is:
formula = percentile * n (n is number of values)
In this case:
25/100 * 4 = 1
So the first position is number 4 but according to the describe function it is 5.5
.
2) Another example says - if you get a whole number then take the average of 4 and 6 - which would be 5 - still does not match 5.5
given by describe.
3) Another tutorial says - you take the difference between the 2 numbers - multiply by 25% and add to the lower number:
25/100 * (6-4) = 1/4*2 = 0.5
Adding that to the lower number: 4 + 0.5 = 4.5
Still not getting 5.5
.
Can someone please clarify?
Pandas DataFrame describe() Method mean - The average (mean) value. std - The standard deviation. min - the minimum value. 25% - The 25% percentile*.
For numeric data, the result's index will include count , mean , std , min , max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75 . The 50 percentile is the same as the median.
For example: s = pd.Series([1, 2, 3, 1]) s.describe() will give count 4.000000 mean 1.750000 std 0.957427 min 1.000000 25% 1.000000 50% 1.500000 75% 2.250000 max 3.000000. 25% means 25% of your data have the value 1.0000 or below. That is if you were to look at your data manually, 25% of it is less than or equal 1.
In the pandas documentation there is information about the computation of quantiles, where a reference to numpy.percentile is made:
Return value at the given quantile, a la numpy.percentile.
Then, checking numpy.percentile explanation, we can see that the interpolation method is set to linear by default:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j
For your specfic case, the 25th quantile results from:
res_25 = 4 + (6-4)*(3/4) = 5.5
For the 75th quantile we then get:
res_75 = 8 + (10-8)*(1/4) = 8.5
If you set the interpolation method to "midpoint", then you will get the results that you thought of.
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With