Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird behavior of using set.seed multiple times

I came up with a strange result when doing my homework in R, can anyone explain to me what's going on?

The instruction told me to set seed 1 to keep consistency.

At first, I set seed(1) twice

set.seed(1)
x <- rnorm(100, mean = 0, sd = 1)
set.seed(1)
epsilon <- rnorm(100, mean = 0, sd = 0.25)
y <- 0.5 * x + epsilon -1
plot(x,y,main = "Scatter plot between X and Y", xlab = "X", ylab = "Y")

I get scatter plot like this: The plot with two set seed

After I only use one set seed the code is:

set.seed(1)
x <- rnorm(100, mean = 0, sd = 1)
epsilon <- rnorm(100, mean = 0, sd = 0.25)
y <- 0.5 * x + epsilon -1
plot(x,y,main = "Scatter plot between X and Y", xlab = "X", ylab = "Y")

The plot became reasonable: The plot with one set seed

Can anyone explain to me why two results are different by adding an extra "set.seed(1)"?

like image 572
SamCXLG Avatar asked Dec 05 '22 14:12

SamCXLG


2 Answers

Set.seed() determines the random numbers that will be generated afterwards. In general it is used to create reproducible examples, so that if we both run the same code, we get the same results. To illustrate:

set.seed(1234)
runif(3)
[1] 0.1137034 0.6222994 0.6092747

set.seed(1234)
runif(3)
[1] 0.1137034 0.6222994 0.6092747

set.seed(12345)
runif(3)
[1] 0.7209039 0.8757732 0.7609823

So as you can see, when you set.seed(x) twice with the same number, you are generating the same random numbers from that point on. (For variables with the same distribution. For others, see the elaboration below). So the reason you are getting a straight line in the first plot, is because

y <- 0.5 * x + epsilon -1

actually becomes

y <- 0.5 * x + x -1

because you are using the same sequence of random numbers two times. That reduces to

y <- 1.5 * x -1

And that is a simple linear equation.

So in general, you should only perform set.seed(x) once, at the beginning of your script.


Elaboration on the comment: "But I generated the Epsilon with different sd, why would that still be the same x, although the plot seems to agree with the explanation?"

That's actually a really interesting question. Random numbers with distribution ~N(mean,sd) are usually generated as follows:

  1. Random Uniform numbers are generated.
  2. A transformation is applied to these numbers, usually the Box-Muller transformation., let's call these numbers X.
  3. These numbers are transformed once more by applying the transformation sd * X + mean

When you run this twice with the same seed but a different mean and sd, the first two steps will create exactly the same results, since the random numbers generated are the same, and the mean and sd are not used yet. Only in the third step do the mean and sd come into play. We can easily verify this:

set.seed(1)
rnorm(4, mean = 0, sd = 1)
[1] -0.6264538  0.1836433 -0.8356286  1.5952808
set.seed(1)
rnorm(4, mean = 0, sd = 0.25)
[1] -0.15661345  0.04591083 -0.20890715  0.39882020

Indeed, the random numbers generated the second time are exactly 0.25 times the numbers generated the first time.

So in my explanation above, epsilon is actually 0.25*x, and your resulting function is y <- 0.75 * x - 1, which is still just a linear function.

like image 134
Florian Avatar answered Dec 21 '22 15:12

Florian


Why the results were different - When set.seed is set once and run twice -

set.seed(123)
runif(3)
[1] 0.2875775 0.7883051 0.4089769
runif(3)
[1] 0.8830174 0.9404673 0.0455565

Whereas when set.seed is set again the results are -

set.seed(123)
runif(6)
[1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565

So, when the seed is set only once, the program uses the next set of available numbers for generating the next set of random numbers

like image 44
vinay Avatar answered Dec 21 '22 16:12

vinay