Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assigning individuals to an income quantile

Tags:

r

I have a set of data such as the following:

annual_exp<-as.data.frame(c(6000,4200,240001,750,20000,3470,10500,2400,2280,36000,3600,20000,2000,12000,1200,3000,4500,64000))
annual_exp<-as.data.frame(annual_exp)

I want to create a new variable, call it "quintile", which assigns each observation an integer between 1 and 5, inclusive, depending on which quintile of income expenditure (annual_exp) they fall into. So there should be an equal number of 1`s through to 5.

My attempt so far has been to do the following:

test<-quantile(annual_exp$annual_exp, probs= seq(0,1,0.2), na.rm=TRUE)
summary(test)
test

breaks<-c(test[1],test[2],test[3],test[4],test[5],test[6])
quantiles<-cut(annual_exp$annual_exp, breaks, labels=c("1","2","3","4","5"), include.lowest=TRUE, right=TRUE)
quantiles<-as.data.frame(quantiles)
quantiles<-cbind(annual_exp, quantiles)

The problem (which doesn`t really show with such a small sample as created in this example), is that the number of people falling into each quantile by doing this varies wildly. This is because I have used the function "quantile" above.

As such, I am looking for an alternative to the "quantile" part of the equation, which will split the sample up into 5 equal groups of quintiles based on their annual expenditure.

Any help on this would be very appreciated!

like image 600
Timothy Alston Avatar asked Aug 28 '12 15:08

Timothy Alston


2 Answers

Here a solution using the data.table package , which is probably the fastest solution (a big concern if you're dealing with large data sets)

library(data.table)

setDT(data)

data[ , newVarDecile := cut(varIncome,
                            breaks=quantile(varIncome,
                            probs=seq(0, 1, by=0.1), na.rm=T),
                            include.lowest= TRUE, labels=1:10) ]

If you want to compute deciles separately for different subgroups, you just need to include by =.

data[ , newVarQuintiles := cut(varIncome,
                               breaks=quantile(varIncome,
                               probs=seq(0, 1, by=0.2), na.rm=T),
                               include.lowest= TRUE, labels=1:5), 
                               by = groupVar ]

ps. Note that in this second example we've computed income quintiles by changing the probs and labels arguments

like image 23
rafa.pereira Avatar answered Sep 21 '22 16:09

rafa.pereira


ggplot2 has a nice utility function, cut_number(), which does just what you want.

library(ggplot2)
as.numeric(cut_number(annual_exp[[1]], n = 5))
# [1] 3 3 5 1 4 2 4 2 1 5 3 4 1 4 1 2 3 5
like image 61
Josh O'Brien Avatar answered Sep 21 '22 16:09

Josh O'Brien