I came up with a strange result when doing my homework in R, can anyone explain to me what's going on?
The instruction told me to set seed 1 to keep consistency.
At first, I set seed(1) twice
set.seed(1)
x <- rnorm(100, mean = 0, sd = 1)
set.seed(1)
epsilon <- rnorm(100, mean = 0, sd = 0.25)
y <- 0.5 * x + epsilon -1
plot(x,y,main = "Scatter plot between X and Y", xlab = "X", ylab = "Y")
I get scatter plot like this: The plot with two set seed
After I only use one set seed the code is:
set.seed(1)
x <- rnorm(100, mean = 0, sd = 1)
epsilon <- rnorm(100, mean = 0, sd = 0.25)
y <- 0.5 * x + epsilon -1
plot(x,y,main = "Scatter plot between X and Y", xlab = "X", ylab = "Y")
The plot became reasonable: The plot with one set seed
Can anyone explain to me why two results are different by adding an extra "set.seed(1)"?
Set.seed() determines the random numbers that will be generated afterwards. In general it is used to create reproducible examples, so that if we both run the same code, we get the same results. To illustrate:
set.seed(1234)
runif(3)
[1] 0.1137034 0.6222994 0.6092747
set.seed(1234)
runif(3)
[1] 0.1137034 0.6222994 0.6092747
set.seed(12345)
runif(3)
[1] 0.7209039 0.8757732 0.7609823
So as you can see, when you set.seed(x) twice with the same number, you are generating the same random numbers from that point on. (For variables with the same distribution. For others, see the elaboration below). So the reason you are getting a straight line in the first plot, is because
y <- 0.5 * x + epsilon -1
actually becomes
y <- 0.5 * x + x -1
because you are using the same sequence of random numbers two times. That reduces to
y <- 1.5 * x -1
And that is a simple linear equation.
So in general, you should only perform set.seed(x)
once, at the beginning of your script.
Elaboration on the comment: "But I generated the Epsilon with different sd, why would that still be the same x, although the plot seems to agree with the explanation?"
That's actually a really interesting question. Random numbers with distribution ~N(mean,sd)
are usually generated as follows:
sd * X + mean
When you run this twice with the same seed but a different mean and sd, the first two steps will create exactly the same results, since the random numbers generated are the same, and the mean and sd are not used yet. Only in the third step do the mean and sd come into play. We can easily verify this:
set.seed(1)
rnorm(4, mean = 0, sd = 1)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808
set.seed(1)
rnorm(4, mean = 0, sd = 0.25)
[1] -0.15661345 0.04591083 -0.20890715 0.39882020
Indeed, the random numbers generated the second time are exactly 0.25 times the numbers generated the first time.
So in my explanation above, epsilon is actually 0.25*x, and your resulting function is y <- 0.75 * x - 1
, which is still just a linear function.
Why the results were different - When set.seed is set once and run twice -
set.seed(123)
runif(3)
[1] 0.2875775 0.7883051 0.4089769
runif(3)
[1] 0.8830174 0.9404673 0.0455565
Whereas when set.seed is set again the results are -
set.seed(123)
runif(6)
[1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565
So, when the seed is set only once, the program uses the next set of available numbers for generating the next set of random numbers
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With