Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Replace multiple values in multiple columns of dataframes with NA

I am trying to achieve something similar to this question but with multiple values that must be replaced by NA, and in large dataset.

df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = rep(1:9), var2 = rep(3:5, each = 3))

which generates this dataframe:

df
  name foo var1 var2
1    a   1    1    3
2    a   2    2    3
3    a   3    3    3
4    b   4    4    4
5    b   5    5    4
6    b   6    6    4
7    c   7    7    5
8    c   8    8    5
9    c   9    9    5

I would like to replace all occurrences of, say, 3 and 4 by NA, but only in the columns that start with "var".

I know that I can use a combination of [] operators to achieve the result I want:

df[,grep("^var[:alnum:]?",colnames(df))][ 
        df[,grep("^var[:alnum:]?",colnames(df))] == 3 |
        df[,grep("^var[:alnum:]?",colnames(df))] == 4
   ] <- NA

df
  name foo var1 var2
1    a   1    1    NA
2    a   2    2    NA
3    a   3    NA   NA
4    b   4    NA   NA
5    b   5    5    NA
6    b   6    6    NA
7    c   7    7    5
8    c   8    8    5
9    c   9    9    5

Now my questions are the following:

  1. Is there a way to do this in an efficient way, given that my actual dataset has about 100.000 lines, and 400 out of 500 variables start with "var". It seems (subjectively) slow on my computer when I use the double brackets technique.
  2. How would I approach the problem if instead of 2 values (3 and 4) to be replaced by NA, I had a long list of, say, 100 various values? Is there a way to specify multiple values with having to do a clumsy series of conditions separated by | operator?
like image 735
Peutch Avatar asked Sep 10 '14 14:09

Peutch


Video Answer


3 Answers

You can also do this using replace:

sel <- grepl("var",names(df))
df[sel] <- lapply(df[sel], function(x) replace(x,x %in% 3:4, NA) )
df

#  name foo var1 var2
#1    a   1    1   NA
#2    a   2    2   NA
#3    a   3   NA   NA
#4    b   4   NA   NA
#5    b   5    5   NA
#6    b   6    6   NA
#7    c   7    7    5
#8    c   8    8    5
#9    c   9    9    5

Some quick benchmarking using a million row sample of data suggests this is quicker than the other answers.

like image 105
thelatemail Avatar answered Sep 29 '22 15:09

thelatemail


You could also do:

col_idx <- grep("^var", names(df))
values <- c(3, 4)
m1 <- as.matrix(df[,col_idx])
m1[m1 %in% values] <- NA
df[col_idx]  <- m1
df
#   name foo var1 var2
#1    a   1    1   NA
#2    a   2    2   NA
#3    a   3   NA   NA
#4    b   4   NA   NA
#5    b   5    5   NA
#6    b   6    6   NA
#7    c   7    7    5
#8    c   8    8    5
#9    c   9    9    5
like image 37
akrun Avatar answered Sep 29 '22 17:09

akrun


Here's an approach:

# the values that should be replaced by NA
values <- c(3, 4)

# index of columns
col_idx <- grep("^var", names(df))
# [1] 3 4

# index of values (within these columns)
val_idx <- sapply(df[col_idx], "%in%", table = values)
#        var1  var2
#  [1,] FALSE  TRUE
#  [2,] FALSE  TRUE
#  [3,]  TRUE  TRUE
#  [4,]  TRUE  TRUE
#  [5,] FALSE  TRUE
#  [6,] FALSE  TRUE
#  [7,] FALSE FALSE
#  [8,] FALSE FALSE
#  [9,] FALSE FALSE

# replace with NA
is.na(df[col_idx]) <- val_idx

df
#   name foo var1 var2
# 1    a   1    1   NA
# 2    a   2    2   NA
# 3    a   3   NA   NA
# 4    b   4   NA   NA
# 5    b   5    5   NA
# 6    b   6    6   NA
# 7    c   7    7    5
# 8    c   8    8    5
# 9    c   9    9    5
like image 4
Sven Hohenstein Avatar answered Sep 29 '22 16:09

Sven Hohenstein