A quick one for you, dearest R gurus:
I'm doing an assignment and I've been asked, in this exercise, to get basic statistics out of the infert
dataset (it's in-built), and specifically one of its columns, infert$age
.
For anyone not familiar with the dataset:
> table_ages # Which is just subset(infert, select=c("age"));
age
1 26
2 42
3 39
4 34
5 35
6 36
7 23
8 32
9 21
10 28
11 29
...
246 35
247 29
248 23
I've had to find median values of the column, variance, skewness, standard deviation which were all okay, until I was asked to find the column "percentiles".
I haven't been able to find anything so far, and maybe I've translated it incorrectly from greek, the language of the assignment. It was "ποσοστημόρια", Google Translate pointed the English term to be "percentiles".
Any tutorials or ideas on finding those "percentiles" of infert$age
?
Let us see how to find the percentile rank of a column in a Pandas DataFrame. We will use the rank() function with the argument pct = True to find the percentile rank. How to Print values above 75th percentile from series Using Quantile using Pandas?
To find percentiles of a numeric column in a DataFrame, or the percentiles of a Series in pandas, the easiest way is to use the pandas quantile() function. You can also use the numpy percentile() function.
If you order a vector x
, and find the values that is half way through the vector, you just found a median, or 50th percentile. Same logic applies for any percentage. Here are two examples.
x <- rnorm(100)
quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1)) # quartile
quantile(x, probs = seq(0, 1, by= 0.1)) # decile
The quantile()
function will do much of what you probably want, but since the question was ambiguous, I will provide an alternate answer that does something slightly different from quantile()
.
ecdf(infert$age)(infert$age)
will generate a vector of the same length as infert$age
giving the proportion of infert$age
that is below each observation. You can read the ecdf
documentation, but the basic idea is that ecdf()
will give you a function that returns the empirical cumulative distribution. Thus ecdf(X)(Y)
is the value of the cumulative distribution of X at the points in Y. If you wanted to know just the probability of being below 30 (thus what percentile 30 is in the sample), you could say
ecdf(infert$age)(30)
The main difference between this approach and using the quantile()
function is that quantile()
requires that you put in the probabilities to get out the levels, and this requires that you put in the levels to get out the probabilities.
Using {dplyr}:
library(dplyr)
# percentiles
infert %>%
mutate(PCT = ntile(age, 100))
# quartiles
infert %>%
mutate(PCT = ntile(age, 4))
# deciles
infert %>%
mutate(PCT = ntile(age, 10))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With