In R, I have 600,000 categorical variables, each of which is classified as "0", "1", or "2".
What I would like to do is collapse "1" and "2" and leave "0" by itself, such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1". In the end I only want "0" and "1" as categories for each of the variables.
Also, if possible, I would rather not create 600,000 new variables, if I can replace the existing variables with the new values that would be great!
What would be the best way to do this?
I find this is even more generic using factor(new.levels[x])
:
> x <- factor(sample(c("0","1","2"), 10, replace=TRUE))
> x
[1] 0 2 2 2 1 2 2 0 2 1
Levels: 0 1 2
> new.levels<-c(0,1,1)
> x <- factor(new.levels[x])
> x
[1] 0 1 1 1 1 1 1 0 1 1
Levels: 0 1
The new levels vector must the same length as the number of levels in x, so you can do more complicated recodes as well using strings and NAs for example
x <- factor(c("old", "new", NA)[x])
> x
[1] old <NA> <NA> <NA> new <NA> <NA> old
[9] <NA> new
Levels: new old
recode()'s a little overkill for this. Your case depends on how it's currently coded. Let's say your variable is x.
If it's numeric
x <- ifelse(x>1, 1, x)
if it's character
x <- ifelse(x=='2', '1', x)
if it's factor with levels 0,1,2
levels(x) <- c(0,1,1)
Any of those can be applied across a data frame dta to the variable x in place. For example...
dta$x <- ifelse(dta$x > 1, 1, dta$x)
Or, multiple columns of a frame
df[,c('col1','col2'] <- sapply(df[,c('col1','col2'], FUN = function(x) ifelse(x==0, x, 1))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With