R: Randomly Replace Values with NA

Question

I am working the R programming language. I am trying select 10% of the elements in my dataset (excluding elements in the first column) and replace them with NA. I tried to do this with the following code:

 library(longitudinalData)
 data(artificialLongData)

second_dataset = artificialLongData
second_dataset[sample(nrow(second_dataset),0.1*nrow(second_dataset ))]<- NA

This produces the following error:

Error in `[<-.data.frame`(`*tmp*`, sample(nrow(second_dataset), 0.1 *  : 
  new columns would leave holes after existing columns

Can someone please show me how to fix this problem?

Thanks!

Note: The final result should look something like this:

  id    t0    t1    t2    t3    t4    t5    t6    t7    t8    t9   t10
1 s1  NA  NA -1.85 -2.05  1.01  1.56  NA  0.52 -0.06 -1.09  0.44
2 s2 -4.88 -2.95 -2.38  3.73 -2.77  1.72 -0.99 -0.70  NA  2.38 -0.72
3 s3  NA -0.86  NA -2.04 -1.18  4.89 NA  0.50  4.90 -0.52  NA

jay.sf · Accepted Answer

You could replace random elements in lapply.

set.seed(42)
r1 <- as.data.frame(lapply(dat, \(x) replace(x, sample(length(x), .1*length(x)), NA)))

r1
#    X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1  NA  7 NA 10  3 11  4  4 NA   7
# 2   6  6  8  8  4 11 NA  8 10   9
# 3   1 12  4  5 12  3 10  3 11   1
# 4   3 10  6  2 11 NA  3 11  2  11
# 5   8 NA 10 12  5  7  2  9  4  10
# 6  12  4  9 12  9  2  7  9  8   8
# 7   7  5  9  4  2 12 12  3  4   4
# 8  12  5  3  1  6  1  4  7  6  NA
# 9   4  6 12 NA  5  8  4  4  6   7
# 10  3  2 11  3 NA  5  4 NA  2   4

mean(is.na(r1))
# [1] 0.1

However, this replaces .1 of the values in each column with NA. If we want each cell to be replaced with NA with a probability of .1, we could use apply on both MARGINS=1:2.

set.seed(42)
p <- .1
r2 <- as.data.frame(apply(dat, 1:2, \(x) sample(c(x, NA), 1, prob=c((1 - p), p))))

r2
#    X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1  NA  7 NA 10  3 11  4  4 12   7
# 2  NA  6  8  8  4 11 NA  8 10   9
# 3   1 NA NA  5 12  3 10  3 11   1
# 4   3 10 NA  2 NA  9  3 11  2  NA
# 5   8 12 10 12  5  7  2  9  4  NA
# 6  12 NA  9 12 NA  2  7  9  8   8
# 7   7 NA  9  4  2 12 12  3  4   4
# 8  12  5 NA  1  6  1  4  7  6  12
# 9   4  6 12 NA NA  8  4  4  6   7
# 10  3  2 11  3  3  5  4  8  2   4
mean(is.na(r2))
# [1] 0.16

If it's possible to coerce the data as.matrix you could treat it like a vector

set.seed(42)
m <- as.matrix(dat)
m[sample(seq_along(m), .1*length(m))] <- NA

m
#       X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
#  [1,]  6  7  1 10  3 11  4 NA 12   7
#  [2,]  6  6  8  8  4 11 10  8 10   9
#  [3,]  1 12  4  5 12  3 10  3 11   1
#  [4,]  3 10 NA  2 11  9  3 NA  2  11
#  [5,]  8 12 NA 12  5  7 NA  9  4  10
#  [6,] 12  4  9 12  9  2  7  9  8   8
#  [7,]  7  5  9  4 NA 12 12  3  4   4
#  [8,] 12 NA  3  1  6  1  4  7  6  12
#  [9,]  4  6 12  3 NA  8  4  4 NA   7
# [10,]  3  2 11  3  3  5  4  8  2  NA

mean(is.na(m))
# [1] 0.1

and coerce back to "data.frame".

dat_na <- as.data.frame(m) |> type.convert(as.is=TRUE)

The type.convert takes care of getting back classes like "numeric" and "character", since matrices can only have one mode. Note that you may lose attributes in the process.

Data:

dat <- structure(list(X1 = c(6L, 6L, 1L, 3L, 8L, 12L, 7L, 12L, 4L, 3L
), X2 = c(7L, 6L, 12L, 10L, 12L, 4L, 5L, 5L, 6L, 2L), X3 = c(1L, 
8L, 4L, 6L, 10L, 9L, 9L, 3L, 12L, 11L), X4 = c(10L, 8L, 5L, 2L, 
12L, 12L, 4L, 1L, 3L, 3L), X5 = c(3L, 4L, 12L, 11L, 5L, 9L, 2L, 
6L, 5L, 3L), X6 = c(11L, 11L, 3L, 9L, 7L, 2L, 12L, 1L, 8L, 5L
), X7 = c(4L, 10L, 10L, 3L, 2L, 7L, 12L, 4L, 4L, 4L), X8 = c(4L, 
8L, 3L, 11L, 9L, 9L, 3L, 7L, 4L, 8L), X9 = c(12L, 10L, 11L, 2L, 
4L, 8L, 4L, 6L, 6L, 2L), X10 = c(7L, 9L, 1L, 11L, 10L, 8L, 4L, 
12L, 7L, 4L)), class = "data.frame", row.names = c(NA, -10L))

augustusgrant · Answer

Using Tidyverse:

library(tidyverse)
second_dataset <- artificialLongData  %>%
  mutate(across(-id,~ifelse(runif(length(.))>0.1,., NA)))

At each column except for id, it creates a vector of TRUE and FALSE, at the specified probability. Each TRUE element in the dataset is retained, each FALSE element is replaced with NA.

To modify just one column, e.g. t1, we would use:

artificialLongData %>%
  mutate(t1,t1=ifelse(runif(length(t1))>0.1,t1, NA))

across() applies the function to all specified columns. -id means all columns except id.

R: Randomly Replace Values with NA

Tags:

r

data-manipulation

stats_noob

2 Answers

jay.sf

augustusgrant

Recent Activity

Donate For Us

R: Randomly Replace Values with NA

Tags:

r

data-manipulation

stats_noob

2 Answers

jay.sf

augustusgrant

Related questions

Recent Activity

Donate For Us