R: creating a categorical variable from a numerical variable and custom/open-ended/single-valued intervals

Tags:

I often find myself trying to create a categorical variable from a numerical variable + a user-provided set of ranges.

For instance, say that I have a data.frame with a numeric variable df$V and would like to create a new variable df$VCAT such that:

df$VCAT = 0 if df$V is equal to 0
df$VCAT = 1 if df$V is between 0 to 10 (i.e. (0,10))
df$VCAT = 2 is df$V is equal to 10 (i.e. [10,10])
df$VCAT = 3 is df$V is between 10 to 20 (i.e. (10,20))
df$VCAT = 4 is df$V is greater or equal to than 20 (i.e. [20,Inf])

I am currently doing this by hard coding the "scoring function" myself by doing something like:

df = data.frame(V = seq(1,100))
df = df %>% mutate(VCAT = (V>0) + (V==10) + 2*(V>10) + (V>=20))

I am wondering if there is an easier hacky way to do this in R, preferably usingdplyr (so that I can chain commands). Ideally, I am looking for a short function that can be used in mutate that will take in the variable V and a vector describing the ranges such as buckets. Note that buckets may not be described in the best way here since it is not clear to me how it would allow users to customize the endpoints of the ranges.

945

asked Feb 04 '16 14:02

Berk U.

1 Answers

A way I bin numbers is to remove the remainder using the modulus opperator, %%. E.g. to bin into groups of 20:

#create raw data
unbinned<-c(1.1,1.53,5,8.3,33.5,49.22,55,57.9,79.6,81,95,201,213)
rawdata<-as.data.frame(unbinned)

#bin the data into groups of 20
binneddata<-mutate(rawdata,binned=unbinned-unbinned %% 20)

#print the data
binneddata

This produces the output:

   unbinned binned
1      1.10      0
2      1.53      0
3      5.00      0
4      8.30      0
5     33.50     20
6     49.22     40
7     55.00     40
8     57.90     40
9     79.60     60
10    81.00     80
11    95.00     80
12   201.00    200
13   213.00    200

So 0 represents 0-<20, 20 represents 20-<40, 40 ,40-<60 etc. (of course divide the binned value by 20 to get sequential groups like in the original question)

Bonus

If you want to use the binned values as categorical variables in ggplot etc. by converting them into strings, they will order strangely, e.g. 200 will come before 40, because '2' comes before '4' in the alphabet, to get around this, use the sprintf function to create leading zeros. (the 3 in %03d should be the number of digits you expect the longest number to be):

#convert the data into strings with leading zeros
binnedstring<-mutate(binneddata,bin_as_character=sprintf('%03d',binned))

#print the data
binnedstring

giving the output:

   unbinned binned bin_as_character
1      1.10      0              000
2      1.53      0              000
3      5.00      0              000
4      8.30      0              000
5     33.50     20              020
etc.

If you want to have 000-<020, create the upper bound using arithmetic and concatenate using the paste function:

#make human readable bin value
binnedstringband<-mutate(
    binnedstring,
    nextband=binned+20,
    human_readable=paste(bin_as_character,'-<',sprintf('%03d',nextband),sep='')
)

#print the data
binnedstringband

Giving:

   unbinned binned bin_as_character nextband     human_readable
1      1.10      0              000       20           000-<020
2      1.53      0              000       20           000-<020
3      5.00      0              000       20           000-<020
4      8.30      0              000       20           000-<020
5     33.50     20              020       40           020-<040
etc.

197

answered Nov 07 '22 02:11

sean

Related questions
                            
                                Suppress package loading message in R package NAMESPACE
                            
                                How can I set graphical parameters (par()) and structure options (strOptions()) in a knitr document?
                            
                                RStudio locally + R cloudly
                            
                                Connecting to a SSAS cube using R
                            
                                Stop nodes/vertices overlapping in igraph.plot
                            
                                Removing unused libraries in R
                            
                                R script stops running when I am not actively using the computer
                            
                                R: Error with polr(): initial value in 'vmmin' is not finite
                            
                                Allowing correlation parameters in gls to depend on grouping factor
                            
                                RMySQL: Closing a connection without a handle
                            
                                clusplot - showing variables
                            
                                "NA" in JSON file translates to NA logical
                            
                                How to predict terms of merMod objects (lme4)?
                            
                                interfacing R to PostgreSQL 9.4 JSONB data type
                            
                                Predicting responses for new observations using a model developed with multiple imputation via MICE
                            
                                Can't get interactive zooming to work with ggvis
                            
                                R package tests are not found when running R CMD check
                            
                                View an rgl plot using Microsoft Azure Machine Learning
                            
                                Extending RColorBrewer to support more colors?
                            
                                Passing additional parameters to dbConnect function for JDBCDriver in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R: creating a categorical variable from a numerical variable and custom/open-ended/single-valued intervals

Tags:

r

dplyr

binning

intervals

categorical-data

Berk U.

People also ask

1 Answers

sean

Recent Activity

Donate For Us