I am working the R programming language. I am trying select 10% of the elements in my dataset (excluding elements in the first column) and replace them with NA. I tried to do this with the following code:
library(longitudinalData)
data(artificialLongData)
second_dataset = artificialLongData
second_dataset[sample(nrow(second_dataset),0.1*nrow(second_dataset ))]<- NA
This produces the following error:
Error in `[<-.data.frame`(`*tmp*`, sample(nrow(second_dataset), 0.1 * :
new columns would leave holes after existing columns
Can someone please show me how to fix this problem?
Thanks!
Note: The final result should look something like this:
id t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
1 s1 NA NA -1.85 -2.05 1.01 1.56 NA 0.52 -0.06 -1.09 0.44
2 s2 -4.88 -2.95 -2.38 3.73 -2.77 1.72 -0.99 -0.70 NA 2.38 -0.72
3 s3 NA -0.86 NA -2.04 -1.18 4.89 NA 0.50 4.90 -0.52 NA
You could replace random elements in lapply.
set.seed(42)
r1 <- as.data.frame(lapply(dat, \(x) replace(x, sample(length(x), .1*length(x)), NA)))
r1
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1 NA 7 NA 10 3 11 4 4 NA 7
# 2 6 6 8 8 4 11 NA 8 10 9
# 3 1 12 4 5 12 3 10 3 11 1
# 4 3 10 6 2 11 NA 3 11 2 11
# 5 8 NA 10 12 5 7 2 9 4 10
# 6 12 4 9 12 9 2 7 9 8 8
# 7 7 5 9 4 2 12 12 3 4 4
# 8 12 5 3 1 6 1 4 7 6 NA
# 9 4 6 12 NA 5 8 4 4 6 7
# 10 3 2 11 3 NA 5 4 NA 2 4
mean(is.na(r1))
# [1] 0.1
However, this replaces .1 of the values in each column with NA. If we want each cell to be replaced with NA with a probability of .1, we could use apply on both MARGINS=1:2.
set.seed(42)
p <- .1
r2 <- as.data.frame(apply(dat, 1:2, \(x) sample(c(x, NA), 1, prob=c((1 - p), p))))
r2
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1 NA 7 NA 10 3 11 4 4 12 7
# 2 NA 6 8 8 4 11 NA 8 10 9
# 3 1 NA NA 5 12 3 10 3 11 1
# 4 3 10 NA 2 NA 9 3 11 2 NA
# 5 8 12 10 12 5 7 2 9 4 NA
# 6 12 NA 9 12 NA 2 7 9 8 8
# 7 7 NA 9 4 2 12 12 3 4 4
# 8 12 5 NA 1 6 1 4 7 6 12
# 9 4 6 12 NA NA 8 4 4 6 7
# 10 3 2 11 3 3 5 4 8 2 4
mean(is.na(r2))
# [1] 0.16
If it's possible to coerce the data as.matrix you could treat it like a vector
set.seed(42)
m <- as.matrix(dat)
m[sample(seq_along(m), .1*length(m))] <- NA
m
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# [1,] 6 7 1 10 3 11 4 NA 12 7
# [2,] 6 6 8 8 4 11 10 8 10 9
# [3,] 1 12 4 5 12 3 10 3 11 1
# [4,] 3 10 NA 2 11 9 3 NA 2 11
# [5,] 8 12 NA 12 5 7 NA 9 4 10
# [6,] 12 4 9 12 9 2 7 9 8 8
# [7,] 7 5 9 4 NA 12 12 3 4 4
# [8,] 12 NA 3 1 6 1 4 7 6 12
# [9,] 4 6 12 3 NA 8 4 4 NA 7
# [10,] 3 2 11 3 3 5 4 8 2 NA
mean(is.na(m))
# [1] 0.1
and coerce back to "data.frame".
dat_na <- as.data.frame(m) |> type.convert(as.is=TRUE)
The type.convert takes care of getting back classes like "numeric" and "character", since matrices can only have one mode. Note that you may lose attributes in the process.
Data:
dat <- structure(list(X1 = c(6L, 6L, 1L, 3L, 8L, 12L, 7L, 12L, 4L, 3L
), X2 = c(7L, 6L, 12L, 10L, 12L, 4L, 5L, 5L, 6L, 2L), X3 = c(1L,
8L, 4L, 6L, 10L, 9L, 9L, 3L, 12L, 11L), X4 = c(10L, 8L, 5L, 2L,
12L, 12L, 4L, 1L, 3L, 3L), X5 = c(3L, 4L, 12L, 11L, 5L, 9L, 2L,
6L, 5L, 3L), X6 = c(11L, 11L, 3L, 9L, 7L, 2L, 12L, 1L, 8L, 5L
), X7 = c(4L, 10L, 10L, 3L, 2L, 7L, 12L, 4L, 4L, 4L), X8 = c(4L,
8L, 3L, 11L, 9L, 9L, 3L, 7L, 4L, 8L), X9 = c(12L, 10L, 11L, 2L,
4L, 8L, 4L, 6L, 6L, 2L), X10 = c(7L, 9L, 1L, 11L, 10L, 8L, 4L,
12L, 7L, 4L)), class = "data.frame", row.names = c(NA, -10L))
Using Tidyverse:
library(tidyverse)
second_dataset <- artificialLongData %>%
mutate(across(-id,~ifelse(runif(length(.))>0.1,., NA)))
At each column except for id, it creates a vector of TRUE and FALSE, at the specified probability. Each TRUE element in the dataset is retained, each FALSE element is replaced with NA.
To modify just one column, e.g. t1, we would use:
artificialLongData %>%
mutate(t1,t1=ifelse(runif(length(t1))>0.1,t1, NA))
across() applies the function to all specified columns. -id means all columns except id.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With