RStudio was crashing when I tried to reshape a particular data frame using dcast
(from the reshape2
package). I discovered that the crash was actually happening in R itself, so I ran my casting code in R.app and got the type of error that gives this site its name: Error: segfault from C stack overflow
. With the help of Google and SO, I learned that this is a memory access error.
Okay, I got that far, but I don't know where to go from here. I can't provide a true reproducible example, because my data frame is about 558,000 rows and the problem doesn't occur on small toy examples. For example, even if I take, say, a 50,000-row subset of the data, dcast
works just fine. Could there be a particular row of data that's causing a problem? If so, can anyone suggest what feature(s) to look for that could be causing the type of error I'm getting?
Here is a subset of the data frame I'm casting from (with fake values for some variables), followed by the casting function I'm using. I've also included this small snippet of data in a dput
function below, in case it would be helpful to play around with it. The real data set has about 700 values of prog
, 15 values of prog1
, and 5 values of fa.type
.
id term yr nslds acad.lev prog prog1 fa.type amount
1 1 Fall 2009 2010 Graduate Graduate loan 1 Other Loans Loan 5000
2 1 Spring 2010 2010 Graduate Graduate loan 1 Other Loans Loan 5000
3 2 Fall 2009 2010 Graduate Graduate loan 2 Stafford Loan Loan 8781
4 2 Spring 2010 2010 Graduate Graduate loan 2 Stafford Loan Loan 8781
5 3 Fall 2007 2008 Graduate Graduate loan 3 Stafford Loan Loan 4250
6 3 Fall 2007 2008 Graduate Graduate grant 1 University Grant Grant 1707
fa.wide = dcast(id + term + yr + nslds + acad.lev ~ prog1 + fa.type , data=fa, value.var="amount", fun.aggregate=sum)
fa = structure(list(id = c(1, 1, 2, 2, 3, 3), term = structure(c(7L,
8L, 7L, 8L, 1L, 1L), .Label = c("Fall 2007", "Spring 2008", "Summer 2008",
"Fall 2008", "Spring 2009", "Summer 2009", "Fall 2009", "Spring 2010",
"Summer 2010", "Fall 2010", "Spring 2011", "Summer 2011", "Fall 2011",
"Spring 2012", "Summer 2012", "Fall 2012", "Spring 2013"), class = c("ordered",
"factor")), yr = c(2010L, 2010L, 2010L, 2010L, 2008L, 2008L),
nslds = structure(c(7L, 7L, 7L, 7L, 7L, 7L), .Label = c("1st Year, Never Attended",
"1st Year, Previously Attended", "2nd Year", "3rd Year",
"4th Year", "5th Year+", "Graduate"), class = c("ordered",
"factor")), acad.lev = structure(c(6L, 6L, 6L, 6L, 6L, 6L
), .Label = c("Freshman", "Sophomore", "Junior", "Senior",
"PB Undergrad", "Graduate"), class = c("ordered", "factor"
)), prog = c("loan 1", "loan 1", "loan 2", "loan 2", "loan 3",
"grant 1"), prog1 = c("Other Loans", "Other Loans", "Stafford Loan",
"Stafford Loan", "Stafford Loan", "University Grant"), fa.type = structure(c(3L,
3L, 3L, 3L, 3L, 2L), .Label = c("Athletic", "Grant", "Loan",
"Scholarship", "Waiver", "Work/Study"), class = "factor"),
amount = c(5000, 5000, 8781, 8781, 4250, 1707)), .Names = c("id",
"term", "yr", "nslds", "acad.lev", "prog", "prog1", "fa.type",
"amount"), row.names = c(NA, 6L), class = "data.frame")
This isn't an answer, but a simple (non-sensical) reproducible example that wouldn't fit in the comments. You can recreate this error with this simple example (on my MacBookPro).
require(reshape2)
n = 1448
df <- data.frame( Student = rep( 1:n , each = 2 ) , Grade = sample( 100 , n*2 , repl = TRUE ) )
df2 <- dcast( df , Student ~ Student , value.var = "Grade" , sum )
Error: segfault from C stack overflow
The error occurs at the boundary n = 1448
, i.e. it doesn't occur when n=1447
and below. It seems that the error is coming from split_indices
in split-numeric.c
from the package plyr
. It could have to do with the fact that the number of grouping levels is assigned to an (unsigned?) integer value, and if the number of groups goes over 32767 it causes a memory access error, but TBH I'm clutching at straws now.
My sessionInfo()
in case anyone can't recreate this error is:
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] reshape2_1.2.2
loaded via a namespace (and not attached):
[1] plyr_1.8 stringr_0.6.2
Interestingly, if I run the df2 <-
command again after getting the first error, R crashes out completely and I get some OS generated error report. I include the relevant portion of the crash log here:
Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_PROTECTION_FAILURE at 0x00007fff5f3ff120
VM Regions Near 0x7fff5f3ff120:
JS JIT generated code 00004d431a401000-00004d431a402000 [ 4K] ---/rwx SM=NUL
--> STACK GUARD 00007fff5bc00000-00007fff5f400000 [ 56.0M] ---/rwx SM=NUL stack guard for thread 0
Stack 00007fff5f400000-00007fff5fc00000 [ 8192K] rw-/rwx SM=COW thread 0
Application Specific Information:
objc[57147]: garbage collection is OFF
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libsystem_c.dylib 0x00007fff897c4632 small_free_scan_madvise_free + 41
1 libsystem_c.dylib 0x00007fff897c5f06 szone_free_definite_size + 4186
2 libsystem_c.dylib 0x00007fff897fe789 free + 194
3 libR.dylib 0x0000000100222dbf R_gc_internal + 7327 (memory.c:952)
4 libR.dylib 0x0000000100224919 Rf_allocVector + 841 (memory.c:2356)
5 plyr.so 0x000000010144bd2c split_indices + 204 (split-numeric.c:23)
6 libR.dylib 0x00000001001b4cc7 do_dotcall + 16311 (dotcode.c:593)
7 libR.dylib 0x00000001001e4448 Rf_eval + 1672 (eval.c:494)
8 libR.dylib 0x00000001001e5edd do_begin + 141 (eval.c:1415)
9 libR.dylib 0x00000001001e429c Rf_eval + 1244 (eval.c:468)
10 libR.dylib 0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
11 libR.dylib 0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
12 libR.dylib 0x00000001001e74e5 do_set + 709 (eval.c:1717)
13 libR.dylib 0x00000001001e429c Rf_eval + 1244 (eval.c:468)
14 libR.dylib 0x00000001001e5edd do_begin + 141 (eval.c:1415)
15 libR.dylib 0x00000001001e429c Rf_eval + 1244 (eval.c:468)
16 libR.dylib 0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
17 libR.dylib 0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
18 libR.dylib 0x00000001001e74e5 do_set + 709 (eval.c:1717)
19 libR.dylib 0x00000001001e429c Rf_eval + 1244 (eval.c:468)
20 libR.dylib 0x00000001001e5edd do_begin + 141 (eval.c:1415)
21 libR.dylib 0x00000001001e429c Rf_eval + 1244 (eval.c:468)
22 libR.dylib 0x00000001001e429c Rf_eval + 1244 (eval.c:468)
23 libR.dylib 0x00000001001e5edd do_begin + 141 (eval.c:1415)
24 libR.dylib 0x00000001001e429c Rf_eval + 1244 (eval.c:468)
25 libR.dylib 0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
26 libR.dylib 0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
27 libR.dylib 0x00000001001e74e5 do_set + 709 (eval.c:1717)
28 libR.dylib 0x00000001001e429c Rf_eval + 1244 (eval.c:468)
29 libR.dylib 0x00000001001e5edd do_begin + 141 (eval.c:1415)
30 libR.dylib 0x00000001001e429c Rf_eval + 1244 (eval.c:468)
31 libR.dylib 0x00000001001e93b1 Rf_applyClosure + 849 (eval.c:861)
32 libR.dylib 0x00000001001e41b2 Rf_eval + 1010 (eval.c:512)
33 libR.dylib 0x00000001001e74e5 do_set + 709 (eval.c:1717)
34 libR.dylib 0x00000001001e429c Rf_eval + 1244 (eval.c:468)
35 libR.dylib 0x000000010021c761 R_ReplDLLdo1 + 481 (main.c:362)
36 org.R-project.R 0x0000000100022c24 run_REngineRmainloop + 196
37 org.R-project.R 0x00000001000159b7 -[REngine runREPL] + 119
38 org.R-project.R 0x0000000100001f24 main + 852
39 org.R-project.R 0x0000000100001914 start + 52
I'm having a same problem in pivoting a long table to wide one using dcast in package reshape2. I found solution in this post plyr split_indices function crashes for long vectors. Specifically, you could download the split_numeric.c and loop-apply.c in this page https://github.com/hadley/plyr/tree/master/src. Uninstall the package plyr from R console, and finally reinstall the package locally: install.packages('/path/to/source', repos=NULL, type='source').
This solves my problem, hope it helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With