I have a set of data such as the following:
annual_exp<-as.data.frame(c(6000,4200,240001,750,20000,3470,10500,2400,2280,36000,3600,20000,2000,12000,1200,3000,4500,64000))
annual_exp<-as.data.frame(annual_exp)
I want to create a new variable, call it "quintile", which assigns each observation an integer between 1 and 5, inclusive, depending on which quintile of income expenditure (annual_exp) they fall into. So there should be an equal number of 1`s through to 5.
My attempt so far has been to do the following:
test<-quantile(annual_exp$annual_exp, probs= seq(0,1,0.2), na.rm=TRUE)
summary(test)
test
breaks<-c(test[1],test[2],test[3],test[4],test[5],test[6])
quantiles<-cut(annual_exp$annual_exp, breaks, labels=c("1","2","3","4","5"), include.lowest=TRUE, right=TRUE)
quantiles<-as.data.frame(quantiles)
quantiles<-cbind(annual_exp, quantiles)
The problem (which doesn`t really show with such a small sample as created in this example), is that the number of people falling into each quantile by doing this varies wildly. This is because I have used the function "quantile" above.
As such, I am looking for an alternative to the "quantile" part of the equation, which will split the sample up into 5 equal groups of quintiles based on their annual expenditure.
Any help on this would be very appreciated!
Here a solution using the data.table
package , which is probably the fastest solution (a big concern if you're dealing with large data sets)
library(data.table)
setDT(data)
data[ , newVarDecile := cut(varIncome,
breaks=quantile(varIncome,
probs=seq(0, 1, by=0.1), na.rm=T),
include.lowest= TRUE, labels=1:10) ]
If you want to compute deciles separately for different subgroups, you just need to include by =
.
data[ , newVarQuintiles := cut(varIncome,
breaks=quantile(varIncome,
probs=seq(0, 1, by=0.2), na.rm=T),
include.lowest= TRUE, labels=1:5),
by = groupVar ]
ps. Note that in this second example we've computed income quintiles by changing the probs
and labels
arguments
ggplot2 has a nice utility function, cut_number()
, which does just what you want.
library(ggplot2)
as.numeric(cut_number(annual_exp[[1]], n = 5))
# [1] 3 3 5 1 4 2 4 2 1 5 3 4 1 4 1 2 3 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With