Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining dplyr::mutate with lubridate::ymd_hms in R randomly causes segfault

I've searched as best I could for something related to this but haven't found anything on SO or dplyr github; possibly a new issue as the code below has worked fine prior to today?

The problem is conceptually simple: calling my_data %>% mutate(x = ymd_hms(x)) sometimes, but not always (i.e. randomly), causes R to crash with a caught segfault. I've reduced the problem to the simplest form I can (also here: https://gist.github.com/john-sandall/05c3abb24fc738ddc2ad):

require(lubridate)
require(dplyr)

set.seed(42)
make_some_random_datetimes = function(n) ymd("2015-01-01") + seconds(runif(n, min=0, max=60*60*24*365))

d = data.frame(
  col1 = make_some_random_datetimes(5000),
  col2 = make_some_random_datetimes(5000)
)

do_it = function() {
  d %>% mutate(
    col1 = ymd_hms(col1),
    col2 = ymd_hms(col2)  # for some reason, it only crashes when evaluating 2+ cols, if we removed this line it'd be fine
  )
  return(TRUE)
}

do_it()  # doesn't crash every time...it fails every nth time where n is randomly distributed with mean of roughly 7.7

do_it_lots_of_times = function(n) for (i in 1:n) do_it()

do_it_lots_of_times(50)  # almost guaranteed to fail on my machine

So at some point, running do_it() above causes a segfault, output running R in Terminal is

*** caught segfault ***
address 0x0, cause 'unknown'

I upgraded to R version 3.2.1 this morning, although rolling back to 3.2.0 and re-installing libraries doesn't help. I then tried uninstalling/re-installing R (using brew install r with fully updated/upgraded homebrew), and then re-installed all of the required packages above. Here's the output of sessionInfo():

R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin14.3.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.2     lubridate_1.3.3

loaded via a namespace (and not attached):
[1] lazyeval_0.1.10 R6_2.0.1        assertthat_0.1  magrittr_1.5    plyr_1.8.3      parallel_3.2.1 
[7] DBI_0.3.1       tools_3.2.1     memoise_0.2.1   Rcpp_0.11.6     stringi_0.5-2   digest_0.6.8   
[13] stringr_1.0.0  

Being a statistician and running out of ideas, I decided to look at the distribution of the failure rate, see if that might help shed some light. If running do_it() above crashes on the nth time, and I wrote down where n falls for 50 crashes (e.g. the 3rd try, then the 7th try), I get this sequence:

3, 7, 9, 20, 9, 9, 9, 7, 4, 23, 6, 3, 3, 3, 7, 7, 3, 9, 6, 6, 7, 10, 13, 7, 3, 7, 4, 7, 9, 6, 7, 7, 6, 6, 7, 7, 7, 9, 6, 12, 7, 7, 5, 9, 18, 6, 7, 9, 9, 7

Which gives me this distribution:

I don't know if that's relevant or helps at all, although the other thing I noticed is increasing the number of rows in the dataframe d from 5000 to 10000 seems to increase the mean of n from ~8 to ~20.

Any help on this would be extremely welcome!

like image 869
John Sandall Avatar asked Jun 23 '15 18:06

John Sandall


1 Answers

90% sure this is a bug in latest version of dplyr (0.4.2), see this issue here: https://github.com/hadley/dplyr/issues/1231

Downgrading my version of dplyr to 0.4.1 as follows fixes the issue:

packageurl = "http://cran.r-project.org/src/contrib/Archive/dplyr/dplyr_0.4.1.tar.gz"
install.packages(packageurl, repos=NULL, type="source", dependencies = TRUE)
like image 133
John Sandall Avatar answered Sep 27 '22 17:09

John Sandall