Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to collapse categories or recategorize variables?

In R, I have 600,000 categorical variables, each of which is classified as "0", "1", or "2".

What I would like to do is collapse "1" and "2" and leave "0" by itself, such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1". In the end I only want "0" and "1" as categories for each of the variables.

Also, if possible, I would rather not create 600,000 new variables, if I can replace the existing variables with the new values that would be great!

What would be the best way to do this?

like image 652
CCA Avatar asked Jul 16 '10 17:07

CCA


2 Answers

I find this is even more generic using factor(new.levels[x]):

> x <- factor(sample(c("0","1","2"), 10, replace=TRUE)) 
> x
 [1] 0 2 2 2 1 2 2 0 2 1
Levels: 0 1 2
> new.levels<-c(0,1,1)
> x <- factor(new.levels[x])
> x
 [1] 0 1 1 1 1 1 1 0 1 1
Levels: 0 1

The new levels vector must the same length as the number of levels in x, so you can do more complicated recodes as well using strings and NAs for example

x <- factor(c("old", "new", NA)[x])
> x
 [1] old    <NA>   <NA>   <NA>   new <NA>   <NA>   old   
 [9] <NA>   new    
Levels: new old
like image 158
maja zaloznik Avatar answered Nov 02 '22 14:11

maja zaloznik


recode()'s a little overkill for this. Your case depends on how it's currently coded. Let's say your variable is x.

If it's numeric

x <- ifelse(x>1, 1, x)

if it's character

x <- ifelse(x=='2', '1', x)

if it's factor with levels 0,1,2

levels(x) <- c(0,1,1)

Any of those can be applied across a data frame dta to the variable x in place. For example...

 dta$x <- ifelse(dta$x > 1, 1, dta$x)

Or, multiple columns of a frame

 df[,c('col1','col2'] <- sapply(df[,c('col1','col2'], FUN = function(x) ifelse(x==0, x, 1))
like image 25
John Avatar answered Nov 02 '22 14:11

John