I am trying to generate a column in a tbl_df
that is a random integer of 0 or 1. This is the code I am using:
library(dplyr)
set.seed(0)
#Dummy data.frame to test
df <- tbl_df(data.frame(x = rep(1:3, each = 4)))
#Generate the random integer column
df_test = df %>%
mutate(pop=sample(0:1, 1, replace=TRUE))
But this does not seem to work the way I expected. The field I generated seems to be all zeros. Is this because the statement within mutate
is evaluated in parallel and hence ends up using the same seed for the first random draw?
df_test
Source: local data frame [12 x 2]
x pop
1 1 0
2 1 0
3 1 0
4 1 0
5 2 0
6 2 0
7 2 0
8 2 0
9 3 0
10 3 0
11 3 0
12 3 0
I am breaking my head over this the past few hours. Any idea what is the flaw in my script?
The way your code is written, you are assigning a single value (the result of the random draw) to the entire vector (this is called "vector recycling").
The best solution in this case is Steven Beaupré's answer, creating a randomized vector the length of your data.frame:
df %>%
mutate(pop = sample(0:1, n(), replace = TRUE))
Generally, if you want to apply a function row-by-row in dplyr
- as you thought would happen here - you can use rowwise()
, though in this example it's not required.
Here's an example of rowwise()
:
df2 <- data.frame(a = c(1,3,6), b = c(2,4,5))
df2 %>%
mutate(m = max(a,b))
a b m
1 1 2 6
2 3 4 6
3 6 5 6
df2 %>%
rowwise() %>%
mutate(m = max(a,b))
a b m
1 1 2 2
2 3 4 4
3 6 5 6
Since rowwise
groups the data by each row operations are potentially slower than without any grouping. Therefore, it's mostly better to use vectorized functions whenever possible instead of operating row-by-row.
Benchmarking:
The approach with rowwise()
is about 30x slower:
library(microbenchmark)
df <- tbl_df(data.frame(x = rep(1:1000, each = 4)))
bench <- microbenchmark(
vectorized = df2 <- df %>% mutate(pop = sample(0:1, n(), replace = TRUE)),
rowwise = df2 <- df %>% rowwise() %>% mutate(pop = sample(0:1, 1, replace = TRUE)),
times = 1000
)
options(microbenchmark.unit="relative")
print(bench)
autoplot(bench)
Unit: relative
expr min lq mean median uq max neval
vectorized 1.00000 1.00000 1.00000 1.00000 1.00000 1.0000 1000
rowwise 42.53169 42.29486 36.94876 33.70456 34.92621 71.7682 1000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With