I am trying to categorize a numeric variable (age) into groups defined by intervals so it will not be continuous. I have this code:
data$agegrp(data$age >= 40 & data$age <= 49) <- 3 data$agegrp(data$age >= 30 & data$age <= 39) <- 2 data$agegrp(data$age >= 20 & data$age <= 29) <- 1
the above code is not working under survival package. It's giving me:
invalid function in complex assignment
Can you point me where the error is? data
is the dataframe I am using.
A Median Split is one method for turning a continuous variable into a categorical one. Essentially, the idea is to find the median of the continuous variable. Any value below the median is put it the category “Low” and every value above it is labeled “High.”
Variables may be classified into two main categories: categorical and numeric. Each category is then classified in two subcategories: nominal or ordinal for categorical variables, discrete or continuous for numeric variables.
You can use the cut() function in R to create a categorical variable from a continuous one. Note that breaks specifies the values to split the continuous variable on and labels specifies the label to give to the values of the new categorical variable.
Binning or discretization is the process of transforming numerical variables into categorical counterparts. An example is to bin values for Age into categories such as 20-39, 40-59, and 60-79. Numerical variables are usually discretized in the modeling methods based on frequency tables (e.g., decision trees).
I would use findInterval()
here:
First, make up some sample data
set.seed(1) ages <- floor(runif(20, min = 20, max = 50)) ages # [1] 27 31 37 47 26 46 48 39 38 21 26 25 40 31 43 34 41 49 31 43
Use findInterval()
to categorize your "ages" vector.
findInterval(ages, c(20, 30, 40)) # [1] 1 2 2 3 1 3 3 2 2 1 1 1 3 2 3 2 3 3 2 3
Alternatively, as recommended in the comments, cut()
is also useful here:
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE) cut(ages, breaks=c(20, 30, 40, 50), right = FALSE, labels = FALSE)
We can use dplyr
:
library(dplyr) data <- data %>% mutate(agegroup = case_when(age >= 40 & age <= 49 ~ '3', age >= 30 & age <= 39 ~ '2', age >= 20 & age <= 29 ~ '1')) # end function
Compared to other approaches, dplyr
is easier to write and interpret.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With