Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Categorize numeric variable into group/ bins/ breaks

I am trying to categorize a numeric variable (age) into groups defined by intervals so it will not be continuous. I have this code:

data$agegrp(data$age >= 40 & data$age <= 49) <- 3 data$agegrp(data$age >= 30 & data$age <= 39) <- 2 data$agegrp(data$age >= 20 & data$age <= 29) <- 1 

the above code is not working under survival package. It's giving me:

invalid function in complex assignment 

Can you point me where the error is? data is the dataframe I am using.

like image 212
leian Avatar asked Oct 19 '12 17:10

leian


People also ask

How do you split a continuous variable into a category?

A Median Split is one method for turning a continuous variable into a categorical one. Essentially, the idea is to find the median of the continuous variable. Any value below the median is put it the category “Low” and every value above it is labeled “High.”

How do you categorize variables?

Variables may be classified into two main categories: categorical and numeric. Each category is then classified in two subcategories: nominal or ordinal for categorical variables, discrete or continuous for numeric variables.

How do you categorize continuous data in R?

You can use the cut() function in R to create a categorical variable from a continuous one. Note that breaks specifies the values to split the continuous variable on and labels specifies the label to give to the values of the new categorical variable.

What is variable binning?

Binning or discretization is the process of transforming numerical variables into categorical counterparts. An example is to bin values for Age into categories such as 20-39, 40-59, and 60-79. Numerical variables are usually discretized in the modeling methods based on frequency tables (e.g., decision trees).


2 Answers

I would use findInterval() here:

First, make up some sample data

set.seed(1) ages <- floor(runif(20, min = 20, max = 50)) ages # [1] 27 31 37 47 26 46 48 39 38 21 26 25 40 31 43 34 41 49 31 43 

Use findInterval() to categorize your "ages" vector.

findInterval(ages, c(20, 30, 40)) # [1] 1 2 2 3 1 3 3 2 2 1 1 1 3 2 3 2 3 3 2 3 

Alternatively, as recommended in the comments, cut() is also useful here:

cut(ages, breaks=c(20, 30, 40, 50), right = FALSE) cut(ages, breaks=c(20, 30, 40, 50), right = FALSE, labels = FALSE) 
like image 80
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 25 '22 14:09

A5C1D2H2I1M1N2O1R2T1


We can use dplyr:

library(dplyr)  data <- data %>% mutate(agegroup = case_when(age >= 40  & age <= 49 ~ '3',                                              age >= 30  & age <= 39 ~ '2',                                              age >= 20  & age <= 29 ~ '1')) # end function 

Compared to other approaches, dplyr is easier to write and interpret.

like image 28
TYL Avatar answered Sep 22 '22 14:09

TYL