Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reasons for using the set.seed function

Tags:

random

r

The need is the possible desire for reproducible results, which may for example come from trying to debug your program, or of course from trying to redo what it does:

These two results we will "never" reproduce as I just asked for something "random":

R> sample(LETTERS, 5)
[1] "K" "N" "R" "Z" "G"
R> sample(LETTERS, 5)
[1] "L" "P" "J" "E" "D"

These two, however, are identical because I set the seed:

R> set.seed(42); sample(LETTERS, 5)
[1] "X" "Z" "G" "T" "O"
R> set.seed(42); sample(LETTERS, 5)
[1] "X" "Z" "G" "T" "O"
R> 

There is vast literature on all that; Wikipedia is a good start. In essence, these RNGs are called Pseudo Random Number Generators because they are in fact fully algorithmic: given the same seed, you get the same sequence. And that is a feature and not a bug.


You have to set seed every time you want to get a reproducible random result.

set.seed(1)
rnorm(4)
set.seed(1)
rnorm(4)

Just adding some addition aspects. Need for setting seed: In the academic world, if one claims that his algorithm achieves, say 98.05% performance in one simulation, others need to be able to reproduce it.

?set.seed

Going through the help file of this function, these are some interesting facts:

(1) set.seed() returns NULL, invisible

(2) "Initially, there is no seed; a new one is created from the current time and the process ID when one is required. Hence different sessions will give different simulation results, by default. However, the seed might be restored from a previous session if a previously saved workspace is restored.", this is why you would want to call set.seed() with same integer values the next time you want a same sequence of random sequence.


Fixing the seed is essential when we try to optimize a function that involves randomly generated numbers (e.g. in simulation based estimation). Loosely speaking, if we do not fix the seed, the variation due to drawing different random numbers will likely cause the optimization algorithm to fail.

Suppose that, for some reason, you want to estimate the standard deviation (sd) of a mean-zero normal distribution by simulation, given a sample. This can be achieved by running a numerical optimization around steps

  1. (Setting the seed)
  2. Given a value for sd, generate normally distributed data
  3. Evaluate the likelihood of your data given the simulated distributions

The following functions do this, once without step 1., once including it:

# without fixing the seed
simllh <- function(sd, y, Ns){
  simdist <- density(rnorm(Ns, mean = 0, sd = sd))
  llh <- sapply(y, function(x){ simdist$y[which.min((x - simdist$x)^2)] })
  return(-sum(log(llh)))
}
# same function with fixed seed
simllh.fix.seed <- function(sd,y,Ns){
  set.seed(48)
  simdist <- density(rnorm(Ns,mean=0,sd=sd))
  llh <- sapply(y,function(x){simdist$y[which.min((x-simdist$x)^2)]})
  return(-sum(log(llh)))
}

We can check the relative performance of the two functions in discovering the true parameter value with a short Monte Carlo study:

N <- 20; sd <- 2 # features of simulated data
est1 <- rep(NA,1000); est2 <- rep(NA,1000) # initialize the estimate stores
for (i in 1:1000) {
  as.numeric(Sys.time())-> t; set.seed((t - floor(t)) * 1e8 -> seed) # set the seed to random seed
  y <- rnorm(N, sd = sd) # generate the data
  est1[i] <- optim(1, simllh, y = y, Ns = 1000, lower = 0.01)$par
  est2[i] <- optim(1, simllh.fix.seed, y = y, Ns = 1000, lower = 0.01)$par
}
hist(est1)
hist(est2)

The resulting distributions of the parameter estimates are:

Histogram of parameter estimates without fixing the seedHistogram of parameter estimates fixing the seed

When we fix the seed, the numerical search ends up close to the true parameter value of 2 far more often.


basically set.seed() function will help to reuse the same set of random variables , which we may need in future to again evaluate particular task again with same random varibales

we just need to declare it before using any random numbers generating function.