Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr: Integer sampling within mutate

Tags:

r

dplyr

I am trying to generate a column in a tbl_df that is a random integer of 0 or 1. This is the code I am using:

library(dplyr)
set.seed(0)

#Dummy data.frame to test
df <- tbl_df(data.frame(x = rep(1:3, each = 4)))

#Generate the random integer column
df_test = df %>% 
  mutate(pop=sample(0:1, 1, replace=TRUE))

But this does not seem to work the way I expected. The field I generated seems to be all zeros. Is this because the statement within mutate is evaluated in parallel and hence ends up using the same seed for the first random draw?

df_test 
Source: local data frame [12 x 2]

   x pop
1  1   0
2  1   0
3  1   0
4  1   0
5  2   0
6  2   0
7  2   0
8  2   0
9  3   0
10 3   0
11 3   0
12 3   0

I am breaking my head over this the past few hours. Any idea what is the flaw in my script?

like image 579
sriramn Avatar asked Apr 26 '15 22:04

sriramn


1 Answers

The way your code is written, you are assigning a single value (the result of the random draw) to the entire vector (this is called "vector recycling").

The best solution in this case is Steven Beaupré's answer, creating a randomized vector the length of your data.frame:

df %>% 
  mutate(pop = sample(0:1, n(), replace = TRUE))

Generally, if you want to apply a function row-by-row in dplyr - as you thought would happen here - you can use rowwise(), though in this example it's not required.

Here's an example of rowwise():

df2 <- data.frame(a = c(1,3,6), b = c(2,4,5))

df2 %>%
  mutate(m = max(a,b))

  a b m
1 1 2 6
2 3 4 6
3 6 5 6

df2 %>%
  rowwise() %>%
  mutate(m = max(a,b))

  a b m
1 1 2 2
2 3 4 4
3 6 5 6

Since rowwise groups the data by each row operations are potentially slower than without any grouping. Therefore, it's mostly better to use vectorized functions whenever possible instead of operating row-by-row.


Benchmarking:

The approach with rowwise() is about 30x slower:

library(microbenchmark)
df <- tbl_df(data.frame(x = rep(1:1000, each = 4)))
bench <- microbenchmark(
  vectorized = df2 <- df %>% mutate(pop = sample(0:1, n(), replace = TRUE)),
  rowwise = df2 <- df %>% rowwise() %>% mutate(pop = sample(0:1, 1, replace = TRUE)),
  times = 1000
  )

options(microbenchmark.unit="relative")
print(bench)
autoplot(bench)

Unit: relative
       expr      min       lq     mean   median       uq     max neval
 vectorized  1.00000  1.00000  1.00000  1.00000  1.00000  1.0000  1000
    rowwise 42.53169 42.29486 36.94876 33.70456 34.92621 71.7682  1000
like image 59
7 revs, 2 users 84% Avatar answered Sep 22 '22 21:09

7 revs, 2 users 84%