R: Replace multiple values in multiple columns of dataframes with NA

Question

I am trying to achieve something similar to this question but with multiple values that must be replaced by NA, and in large dataset.

df <- data.frame(name = rep(letters[1:3], each = 3), foo=rep(1:9),var1 = rep(1:9), var2 = rep(3:5, each = 3))

which generates this dataframe:

df
  name foo var1 var2
1    a   1    1    3
2    a   2    2    3
3    a   3    3    3
4    b   4    4    4
5    b   5    5    4
6    b   6    6    4
7    c   7    7    5
8    c   8    8    5
9    c   9    9    5

I would like to replace all occurrences of, say, 3 and 4 by NA, but only in the columns that start with "var".

I know that I can use a combination of [] operators to achieve the result I want:

df[,grep("^var[:alnum:]?",colnames(df))][ 
        df[,grep("^var[:alnum:]?",colnames(df))] == 3 |
        df[,grep("^var[:alnum:]?",colnames(df))] == 4
   ] <- NA

df
  name foo var1 var2
1    a   1    1    NA
2    a   2    2    NA
3    a   3    NA   NA
4    b   4    NA   NA
5    b   5    5    NA
6    b   6    6    NA
7    c   7    7    5
8    c   8    8    5
9    c   9    9    5

Now my questions are the following:

Is there a way to do this in an efficient way, given that my actual dataset has about 100.000 lines, and 400 out of 500 variables start with "var". It seems (subjectively) slow on my computer when I use the double brackets technique.
How would I approach the problem if instead of 2 values (3 and 4) to be replaced by NA, I had a long list of, say, 100 various values? Is there a way to specify multiple values with having to do a clumsy series of conditions separated by | operator?

thelatemail · Accepted Answer

You can also do this using replace:

sel <- grepl("var",names(df))
df[sel] <- lapply(df[sel], function(x) replace(x,x %in% 3:4, NA) )
df

#  name foo var1 var2
#1    a   1    1   NA
#2    a   2    2   NA
#3    a   3   NA   NA
#4    b   4   NA   NA
#5    b   5    5   NA
#6    b   6    6   NA
#7    c   7    7    5
#8    c   8    8    5
#9    c   9    9    5

Some quick benchmarking using a million row sample of data suggests this is quicker than the other answers.

akrun · Answer

You could also do:

col_idx <- grep("^var", names(df))
values <- c(3, 4)
m1 <- as.matrix(df[,col_idx])
m1[m1 %in% values] <- NA
df[col_idx]  <- m1
df
#   name foo var1 var2
#1    a   1    1   NA
#2    a   2    2   NA
#3    a   3   NA   NA
#4    b   4   NA   NA
#5    b   5    5   NA
#6    b   6    6   NA
#7    c   7    7    5
#8    c   8    8    5
#9    c   9    9    5

Sven Hohenstein · Answer

Here's an approach:

# the values that should be replaced by NA
values <- c(3, 4)

# index of columns
col_idx <- grep("^var", names(df))
# [1] 3 4

# index of values (within these columns)
val_idx <- sapply(df[col_idx], "%in%", table = values)
#        var1  var2
#  [1,] FALSE  TRUE
#  [2,] FALSE  TRUE
#  [3,]  TRUE  TRUE
#  [4,]  TRUE  TRUE
#  [5,] FALSE  TRUE
#  [6,] FALSE  TRUE
#  [7,] FALSE FALSE
#  [8,] FALSE FALSE
#  [9,] FALSE FALSE

# replace with NA
is.na(df[col_idx]) <- val_idx

df
#   name foo var1 var2
# 1    a   1    1   NA
# 2    a   2    2   NA
# 3    a   3   NA   NA
# 4    b   4   NA   NA
# 5    b   5    5   NA
# 6    b   6    6   NA
# 7    c   7    7    5
# 8    c   8    8    5
# 9    c   9    9    5

R: Replace multiple values in multiple columns of dataframes with NA

Tags:

replace

dataframe

r

multiple-columns

Peutch

Video Answer

3 Answers

thelatemail

akrun

Sven Hohenstein

Recent Activity

Donate For Us

R: Replace multiple values in multiple columns of dataframes with NA

Tags:

replace

dataframe

r

multiple-columns

Peutch

Video Answer

3 Answers

thelatemail

akrun

Sven Hohenstein

Related questions

Recent Activity

Donate For Us