Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

replace NA in a dataframe with random numbers within a range

Tags:

dataframe

r

I have the following dataframe named cars

Brand      year     mpg        reputation      Luxury
Honda      2010     30            8.5            0.5
Honda      2011     28            8.5            0.6
Dodge      2010     20            6.5            0.6
Dodge      2011     23            7.0            0.7
Mercedes   2010     22            9.5            NA
Mercedes   2011     25            9.0            NA

I want to replace the NA with randomly generated real numbers between 0.9 and 1.0

I am trying with the following, but it is replacing the NA with the number 0.9

cars[is.na(cars)] <-  sample(0.9:1, sum(is.na(cars)),replace=TRUE)

The datatable will look something like this:

Brand      year     mpg        reputation      Luxury
Honda      2010     30            8.5            0.5
Honda      2011     28            8.5            0.6
Dodge      2010     20            6.5            0.6
Dodge      2011     23            7.0            0.7
Mercedes   2010     22            9.5           *0.91*
Mercedes   2011     25            9.0           *0.97*

Code for data structure:

cars <- structure(list(Brand = c("Honda","Honda", "Dodge", "Dodge","Mercedes","Mercedes"), 
   year = c(2010L, 2011L,2010L, 2011L, 2010L, 2011L), 
   mpg = c(30L, 28L, 20L, 23L, 22L, 25L), reputation = c(8.5, 8.5, 6.5, 7L, 9.5, 9.5), Luxury = c(5L, 5.5, 6L, 6.5)), 
  class = "data.frame", row.names = c(NA, -4L))      
like image 737
JodeCharger100 Avatar asked Jan 01 '23 01:01

JodeCharger100


2 Answers

use runif instead of sample:

cars[is.na(cars)] <-  runif(sum(is.na(cars)), min = 0.9, max = 1)
like image 160
Cettt Avatar answered Jan 08 '23 00:01

Cettt


That is because 0.9:1 gives you only one number which is 0.9. Try,

0.9:1
#[1] 0.9

Hence, it is replacing those numbers by 0.9.

Let's say you need the sequence as

vals <- seq(0.9, 1, 0.01)
vals
#[1] 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00

Now, we can sample on this sequence

df[is.na(df)] <- sample(vals, sum(is.na(df)), replace = TRUE)

df
#     Brand year mpg reputation Luxury
#1    Honda 2010  30        8.5   5.00
#2    Honda 2011  28        8.5   5.50
#3    Dodge 2010  20        6.5   6.00
#4    Dodge 2011  23        7.0   6.50
#5 Mercedes 2010  22        9.5   0.91
#6 Mercedes 2011  25        9.0   0.92

data

df <- structure(list(Brand = structure(c(2L, 2L, 1L, 1L, 3L, 3L), 
.Label = c("Dodge", 
"Honda", "Mercedes"), class = "factor"), year = c(2010L, 2011L, 
2010L, 2011L, 2010L, 2011L), mpg = c(30L, 28L, 20L, 23L, 22L, 
25L), reputation = c(8.5, 8.5, 6.5, 7, 9.5, 9), Luxury = c(5, 
5.5, 6, 6.5, NA, NA)), class = "data.frame", row.names = c(NA, -6L))
like image 30
Ronak Shah Avatar answered Jan 08 '23 02:01

Ronak Shah