Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

segfault in R using reshape2 package and dcast

RStudio was crashing when I tried to reshape a particular data frame using dcast (from the reshape2 package). I discovered that the crash was actually happening in R itself, so I ran my casting code in R.app and got the type of error that gives this site its name: Error: segfault from C stack overflow. With the help of Google and SO, I learned that this is a memory access error.

Okay, I got that far, but I don't know where to go from here. I can't provide a true reproducible example, because my data frame is about 558,000 rows and the problem doesn't occur on small toy examples. For example, even if I take, say, a 50,000-row subset of the data, dcast works just fine. Could there be a particular row of data that's causing a problem? If so, can anyone suggest what feature(s) to look for that could be causing the type of error I'm getting?

Here is a subset of the data frame I'm casting from (with fake values for some variables), followed by the casting function I'm using. I've also included this small snippet of data in a dput function below, in case it would be helpful to play around with it. The real data set has about 700 values of prog, 15 values of prog1, and 5 values of fa.type.

  id        term   yr    nslds acad.lev    prog            prog1 fa.type amount
1  1   Fall 2009 2010 Graduate Graduate  loan 1      Other Loans    Loan   5000
2  1 Spring 2010 2010 Graduate Graduate  loan 1      Other Loans    Loan   5000
3  2   Fall 2009 2010 Graduate Graduate  loan 2    Stafford Loan    Loan   8781
4  2 Spring 2010 2010 Graduate Graduate  loan 2    Stafford Loan    Loan   8781
5  3   Fall 2007 2008 Graduate Graduate  loan 3    Stafford Loan    Loan   4250
6  3   Fall 2007 2008 Graduate Graduate grant 1 University Grant   Grant   1707

fa.wide = dcast(id + term + yr + nslds + acad.lev ~ prog1 + fa.type , data=fa, value.var="amount", fun.aggregate=sum)

fa = structure(list(id = c(1, 1, 2, 2, 3, 3), term = structure(c(7L, 
8L, 7L, 8L, 1L, 1L), .Label = c("Fall 2007", "Spring 2008", "Summer 2008", 
"Fall 2008", "Spring 2009", "Summer 2009", "Fall 2009", "Spring 2010", 
"Summer 2010", "Fall 2010", "Spring 2011", "Summer 2011", "Fall 2011", 
"Spring 2012", "Summer 2012", "Fall 2012", "Spring 2013"), class = c("ordered", 
"factor")), yr = c(2010L, 2010L, 2010L, 2010L, 2008L, 2008L), 
    nslds = structure(c(7L, 7L, 7L, 7L, 7L, 7L), .Label = c("1st Year, Never Attended", 
    "1st Year, Previously Attended", "2nd Year", "3rd Year", 
    "4th Year", "5th Year+", "Graduate"), class = c("ordered", 
    "factor")), acad.lev = structure(c(6L, 6L, 6L, 6L, 6L, 6L
    ), .Label = c("Freshman", "Sophomore", "Junior", "Senior", 
    "PB Undergrad", "Graduate"), class = c("ordered", "factor"
    )), prog = c("loan 1", "loan 1", "loan 2", "loan 2", "loan 3", 
    "grant 1"), prog1 = c("Other Loans", "Other Loans", "Stafford Loan", 
    "Stafford Loan", "Stafford Loan", "University Grant"), fa.type = structure(c(3L, 
    3L, 3L, 3L, 3L, 2L), .Label = c("Athletic", "Grant", "Loan", 
    "Scholarship", "Waiver", "Work/Study"), class = "factor"), 
    amount = c(5000, 5000, 8781, 8781, 4250, 1707)), .Names = c("id", 
"term", "yr", "nslds", "acad.lev", "prog", "prog1", "fa.type", 
"amount"), row.names = c(NA, 6L), class = "data.frame")
like image 902
eipi10 Avatar asked Mar 05 '13 18:03

eipi10


2 Answers

This isn't an answer, but a simple (non-sensical) reproducible example that wouldn't fit in the comments. You can recreate this error with this simple example (on my MacBookPro).

require(reshape2)
n = 1448
df <- data.frame( Student = rep( 1:n , each = 2 ) , Grade = sample( 100 , n*2 , repl = TRUE ) )
df2 <- dcast( df , Student ~ Student , value.var = "Grade" , sum )
Error: segfault from C stack overflow

The error occurs at the boundary n = 1448, i.e. it doesn't occur when n=1447 and below. It seems that the error is coming from split_indices in split-numeric.c from the package plyr. It could have to do with the fact that the number of grouping levels is assigned to an (unsigned?) integer value, and if the number of groups goes over 32767 it causes a memory access error, but TBH I'm clutching at straws now.

My sessionInfo() in case anyone can't recreate this error is:

R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.2.2

loaded via a namespace (and not attached):
[1] plyr_1.8      stringr_0.6.2

Interestingly, if I run the df2 <- command again after getting the first error, R crashes out completely and I get some OS generated error report. I include the relevant portion of the crash log here:

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_PROTECTION_FAILURE at 0x00007fff5f3ff120

VM Regions Near 0x7fff5f3ff120:
    JS JIT generated code  00004d431a401000-00004d431a402000 [    4K] ---/rwx SM=NUL  
--> STACK GUARD            00007fff5bc00000-00007fff5f400000 [ 56.0M] ---/rwx SM=NUL  stack guard for thread 0
    Stack                  00007fff5f400000-00007fff5fc00000 [ 8192K] rw-/rwx SM=COW  thread 0

Application Specific Information:
objc[57147]: garbage collection is OFF

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_c.dylib               0x00007fff897c4632 small_free_scan_madvise_free + 41
1   libsystem_c.dylib               0x00007fff897c5f06 szone_free_definite_size + 4186
2   libsystem_c.dylib               0x00007fff897fe789 free + 194
3   libR.dylib                      0x0000000100222dbf R_gc_internal + 7327 (memory.c:952)
4   libR.dylib                      0x0000000100224919 Rf_allocVector + 841 (memory.c:2356)
5   plyr.so                         0x000000010144bd2c split_indices + 204 (split-numeric.c:23)
6   libR.dylib                      0x00000001001b4cc7 do_dotcall + 16311 (dotcode.c:593)
7   libR.dylib                      0x00000001001e4448 Rf_eval + 1672 (eval.c:494)
8   libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
9   libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
10  libR.dylib                      0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
11  libR.dylib                      0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
12  libR.dylib                      0x00000001001e74e5 do_set + 709 (eval.c:1717)
13  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
14  libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
15  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
16  libR.dylib                      0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
17  libR.dylib                      0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
18  libR.dylib                      0x00000001001e74e5 do_set + 709 (eval.c:1717)
19  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
20  libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
21  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
22  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
23  libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
24  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
25  libR.dylib                      0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
26  libR.dylib                      0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
27  libR.dylib                      0x00000001001e74e5 do_set + 709 (eval.c:1717)
28  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
29  libR.dylib                      0x00000001001e5edd do_begin + 141 (eval.c:1415)
30  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
31  libR.dylib                      0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
32  libR.dylib                      0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
33  libR.dylib                      0x00000001001e74e5 do_set + 709 (eval.c:1717)
34  libR.dylib                      0x00000001001e429c Rf_eval + 1244 (eval.c:468)
35  libR.dylib                      0x000000010021c761 R_ReplDLLdo1 + 481 (main.c:362)
36  org.R-project.R                 0x0000000100022c24 run_REngineRmainloop + 196
37  org.R-project.R                 0x00000001000159b7 -[REngine runREPL] + 119
38  org.R-project.R                 0x0000000100001f24 main + 852
39  org.R-project.R                 0x0000000100001914 start + 52
like image 132
Simon O'Hanlon Avatar answered Nov 04 '22 07:11

Simon O'Hanlon


I'm having a same problem in pivoting a long table to wide one using dcast in package reshape2. I found solution in this post plyr split_indices function crashes for long vectors. Specifically, you could download the split_numeric.c and loop-apply.c in this page https://github.com/hadley/plyr/tree/master/src. Uninstall the package plyr from R console, and finally reinstall the package locally: install.packages('/path/to/source', repos=NULL, type='source').

This solves my problem, hope it helps.

like image 43
X.X Avatar answered Nov 04 '22 06:11

X.X