Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I add random `NA`s into a data frame

Tags:

dataframe

r

apply

I created a data frame with random values

n <- 50
df <- data.frame(id = seq (1:n),
age = sample(c(20:90), n, rep = TRUE), 
sex = sample(c("m", "f"), n, rep = TRUE, prob = c(0.55, 0.45))
)

and would like to introduce a few NA values to simulate real world data. I am trying to use apply but cannot get there. The line

apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]})

will retrieve random values alright, but

apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]<-NA}) 

will not set them to NA. Have tried with and within, too.

Brute force works:

for (i in (1:floor(n/10))) {
  df[sample(c(1:n), 1), sample(c(2:ncol(df)), 1)] <- NA
  }

But I'd prefer to use the apply family.

like image 238
K Owen - Reinstate Monica Avatar asked Jan 01 '14 20:01

K Owen - Reinstate Monica


3 Answers

Return x within your function:

> df <- apply (df, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} )
> tail(df)
      id   age  sex
[45,] "45" "41" NA 
[46,] "46" NA   "f"
[47,] "47" "38" "f"
[48,] "48" "32" "f"
[49,] "49" "53" NA 
[50,] "50" "74" "f"
like image 90
lukeA Avatar answered Oct 17 '22 00:10

lukeA


Using dplyr1 you could arrive at the desired solution using the following, compact, syntax:

set.seed(123)
library("tidyverse")
n <- 50
df <- data.frame(
  id = seq (1:n),
  age = sample(c(20:90), n, replace  = TRUE),
  sex = sample(c("m", "f"), n, replace = TRUE, prob = c(0.55, 0.45))
)
mutate(.data = as_tibble(df),
       across(
         .cols = all_of(c("age", "sex")),
         .fns = ~ ifelse(row_number(.x) %in% sample(1:n(), size = (10 * n(
         ) / 100)), NA, .x)
       ))

Results

Approximatly 10% of values is replaced with NA per column. This follows from sample(1:n(), size = (10 * n() / 100))

count(.Last.value, sex)
#   A tibble: 3 x 2
#   sex       n
#   <chr> <int>
# 1 f        21
# 2 m        24
# 3 NA        5

#  A tibble: 50 x 3
#      id   age sex  
#   <int> <int> <chr>
# 1     1    50 m    
# 2     2    70 m  

1 I'm loading tidyverse as replace_na is available via tidyr.

like image 22
Konrad Avatar answered Oct 17 '22 00:10

Konrad


Apply returns an array, thereby converting all columns to the same type. You could use this instead:

df[,-1] <- do.call(cbind.data.frame, 
                   lapply(df[,-1], function(x) {
                     x[sample(c(1:n),floor(n/10))]<-NA
                     x
                   })
                   )

Or use a for loop:

for (i in seq_along(df[,-1])+1) {
  is.na(df[sample(seq_len(n), floor(n/10)),i]) <- TRUE
}
like image 4
Roland Avatar answered Oct 16 '22 22:10

Roland