Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Why does "ifelse" coerce factor into integer? [duplicate]

Tags:

r

I'm attempting to change values of a variable into NA values if they're not in a vector:

sample <- factor(c('01', '014', '1', '14', '24'))
df <- data.frame(var1 = 1:6, var2 = factor(c('01', '24', 'none', '1', 'unknown', '24')))
df$var2 <- ifelse(df$var2 %in% sample, df$var2, NA)

For some reason R does not preserve original values of the factor variable but turns them into numeric sequence:

> sample <- factor(c('01', '014', '1', '14', '24'))
> df <- data.frame(var1 = 1:6, 
                   var2 = factor(c('01', '24', 'none', '1', 'unknown', '24')))
> class(df$var2)
[1] "factor"
> df
  var1    var2
1    1      01
2    2      24
3    3    none
4    4       1
5    5 unknown
6    6      24
> df$var2 <- ifelse(df$var2 %in% sample, df$var2, NA)
> class(df$var2)
[1] "integer"
> df
  var1 var2
1    1    1
2    2    3
3    3   NA
4    4    2
5    5   NA
6    6    3

Why does this happen and what would be the correct way of achieving what I'm trying to here?

(I need to use factors rather than integers in order not to confuse "01" and "1" and my original data set is large, so using factors rather than characters should save me some memory)

like image 683
lillemets Avatar asked Nov 10 '16 08:11

lillemets


1 Answers

I think one way to achieve what you are trying to do is to change the levels of your factor:

levels(df$var2)[!levels(df$var2) %in% sample] <- NA

By changing the levels all the values that are not matching these levels will be converted to the factor NA and the result will be:

df
  var1 var2
1    1   01
2    2   24
3    3 <NA>
4    4    1
5    5 <NA>
6    6   24

> df$var2
[1] 01   24   <NA> 1    <NA> 24  
Levels: 01 1 24

The unknown and none values are no longer in the factor levels. Or if you would like to keep the unknown and none in your values you could try this:

df$var2[!df$var2 %in% sample] <- NA

> df
  var1 var2
1    1   01
2    2   24
3    3 <NA>
4    4    1
5    5 <NA>
6    6   24


> df$var2
[1] 01   24   <NA> 1    <NA> 24  
Levels: 01 1 24 none unknown

The reason why ifelse is changing the class of your data is that ifelse does not maintain class. Read the second answer here: How to prevent ifelse() from turning Date objects into numeric objects

And a last way as @tchakravarty mentioned in the comments is to use if_else from dplyr!

like image 54
User2321 Avatar answered Oct 01 '22 23:10

User2321