I have a problem running some R scripts on our cluster. The problems appeared suddenly (all the scripts were working just fine but one day they started giving a caught segfault
error). I cannot provide reproducible code because I can't even reproduce the error on my own computer - it only happens on the cluster. I am also using the same code for two sets of data - one is quite small and runs fine, the other one works with bigger data frames (about 10 million rows) and collapses at certain points. I am only using packages from CRAN repository; R and all the packages should be up to date. The error shows up at totally unrelated actions, see the examples below:
Session info:
R version 3.4.3 (2017-11-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Writing variable to NetCDF file
# code snippet
library(ncdf4)
library(reshape2)
input <- read.csv("input_file.csv")
species <- "no2"
dimX <- ncdim_def(name="x", units = "m", vals = unique(input$x), unlim = FALSE)
dimY <- ncdim_def(name="y", units = "m", vals = unique(input$y), unlim = FALSE)
dimTime <- ncdim_def(name = "time", units = "hours", unlim = TRUE)
varOutput <- ncvar_def(name = species, units = "ug/m3",
dim = list(dimX, dimY, dimTime), missval = -9999, longname = species)
nc_file <- nc_create(filename = "outFile.nc", vars = list(varOutput), force_v4 = T)
ncvar_put(nc = nc_file, varid = species, vals = acast(input, x~y), start = c(1,1,1),
count = c(length(unique(input$x)), length(unique(input$y)), 1))
At this point, I get the following error:
*** caught segfault ***
address 0x2b607cac2000, cause 'memory not mapped'
Traceback:
1: id(rev(ids), drop = FALSE)
2: cast(data, formula, fun.aggregate, ..., subset = subset, fill = fill, drop = drop, value.var = value.var)
3: acast(result, x ~ y)
4: ncvar_put(nc = nc_file, varid = species, vals = acast(result, x ~ y), start = c(1, 1), count = c(length(unique(result$x)), length(unique(result$y))))
An irrecoverable exception occurred. R is aborting now ...
/opt/sge/default/spool/node10/job_scripts/122270: line 3: 13959 Segmentation fault (core dumped)
Complex code with parallel computation
*** caught segfault ***
address 0x330d39b40, cause 'memory not mapped'
Traceback:
1: .Call(gstat_fit_variogram, as.integer(fit.method), as.integer(fit.sills), as.integer(fit.ranges))
2: fit.variogram(experimental_variogram, model = vgm(psill = psill, model = model, range = range, nugget = nugget, kappa = kappa), fit.ranges = c(fit_range), fit.sills = c(fit_nugget, fit_sill), debug.level = 0)
3: doTryCatch(return(expr), name, parentenv, handler)
4: tryCatchOne(expr, names, parentenv, handlers[[1L]])
5: tryCatchList(expr, classes, parentenv, handlers)
6: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = outFile) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))})
7: try(fit.variogram(experimental_variogram, model = vgm(psill = psill, model = model, range = range, nugget = nugget, kappa = kappa), fit.ranges = c(fit_range), fit.sills = c(fit_nugget, fit_sill), debug.level = 0), TRUE)
8: getModel(initial_sill - initial_nugget, m, initial_range, k, initial_nugget, fit_range, fit_sill, fit_nugget, verbose = verbose)
9: autofitVariogram(lmResids ~ 1, obsDf, model = "Mat", kappa = c(0.05, seq(0.2, 2, 0.1), 3, 5, 10, 15), fix.values = c(NA, NA, NA), start_vals = c(NA, NA, NA), verbose = F)
10: main_us(obsDf[obsDf$class == "rural" | obsDf$class == "rural-nearcity" | obsDf$class == "rural-regional" | obsDf$class == "rural-remote", ], grd_alt, grd_pop, lm_us, fitvar_us, logTransform, plots, "RuralSt", period, preds)
11: doTryCatch(return(expr), name, parentenv, handler)
12: tryCatchOne(expr, names, parentenv, handlers[[1L]])
13: tryCatchList(expr, classes, parentenv, handlers)
14: tryCatch(main_us(obsDf[obsDf$class == "rural" | obsDf$class == "rural-nearcity" | obsDf$class == "rural-regional" | obsDf$class == "rural-remote", ], grd_alt, grd_pop, lm_us, fitvar_us, logTransform, plots, "RuralSt", period, preds), error = function(e) e)
15: eval(.doSnowGlobals$expr, envir = .doSnowGlobals$exportenv)
16: eval(.doSnowGlobals$expr, envir = .doSnowGlobals$exportenv)
17: doTryCatch(return(expr), name, parentenv, handler)
18: tryCatchOne(expr, names, parentenv, handlers[[1L]])
19: tryCatchList(expr, classes, parentenv, handlers)
20: tryCatch(eval(.doSnowGlobals$expr, envir = .doSnowGlobals$exportenv), error = function(e) e)
21: (function (args) { lapply(names(args), function(n) assign(n, args[[n]], pos = .doSnowGlobals$exportenv)) tryCatch(eval(.doSnowGlobals$expr, envir = .doSnowGlobals$exportenv), error = function(e) e)})(quote(list(timeIndex = 255L)))
22: do.call(msg$data$fun, msg$data$args, quote = TRUE)
23: doTryCatch(return(expr), name, parentenv, handler)
24: tryCatchOne(expr, names, parentenv, handlers[[1L]])
25: tryCatchList(expr, classes, parentenv, handlers)
26: tryCatch(do.call(msg$data$fun, msg$data$args, quote = TRUE), error = handler)
27: doTryCatch(return(expr), name, parentenv, handler)
28: tryCatchOne(expr, names, parentenv, handlers[[1L]])
29: tryCatchList(expr, classes, parentenv, handlers)
30: tryCatch({ msg <- recvData(master) if (msg$type == "DONE") { closeNode(master) break } else if (msg$type == "EXEC") { success <- TRUE handler <- function(e) { success <<- FALSE structure(conditionMessage(e), class = c("snow-try-error", "try-error")) } t1 <- proc.time() value <- tryCatch(do.call(msg$data$fun, msg$data$args, quote = TRUE), error = handler) t2 <- proc.time() value <- list(type = "VALUE", value = value, success = success, time = t2 - t1, tag = msg$data$tag) msg <- NULL sendData(master, value) value <- NULL }}, interrupt = function(e) NULL)
31: slaveLoop(makeSOCKmaster(master, port, timeout, useXDR))
32: parallel:::.slaveRSOCK()
An irrecoverable exception occurred. R is aborting now ...
Is it likely that there is an issue with the cluster rather than the code (or R)? I don't know if it could be related, but since some time ago we've been getting error messages like these:
Message from syslogd@master1 at Mar 8 13:51:37 ...
kernel:[Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.
Message from syslogd@master1 at Mar 8 13:51:37 ...
kernel:[Hardware Error]: Error Status: Corrected error, no action required.
Message from syslogd@master1 at Mar 8 13:51:37 ...
kernel:[Hardware Error]: CPU:4 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c08400067080a13
Message from syslogd@master1 at Mar 8 13:51:37 ...
kernel:[Hardware Error]: MC4_ADDR: 0x000000048f32b490
Message from syslogd@master1 at Mar 8 13:51:37 ...
kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
I have tried to uninstall and reinstall packages based on this question but it didn't help.
The problem is a mismatch between currently installed shared libraries and the libraries that were built to install R or packages.
I got this error for the first time today. See below. I've solved it, can explain situation.
This is an Ubuntu system that was recently upgraded from 17.10 to 18.04, running R-3.4.4. A lot of C and C++ libraries were replaced. But not all programs were replaced. Immediately I noticed that lots of programs were getting segmentation faults. Anything that touched the tidyverse was a fail. The stringi
package could not find the shared libraries with which it was compiled.
The example here is a bit interesting because it happens when running the "R CMD check" for a package, which, at least in theory, should be safe. I found the fix was to remove the packages "RCurl" and "url" and rebuild them.
Here's the symptom, anyway
* checking for file ‘kutils.gitex/DESCRIPTION’ ... OK
* preparing ‘kutils’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* looking to see if a ‘data/datalist’ file should be added
* re-saving image files
* building ‘kutils_1.40.tar.gz’
Warning: invalid uid value replaced by that for user 'nobody'
Warning: invalid gid value replaced by that for user 'nobody'
Run check: OK? (y or n)y
* using log directory ‘/home/pauljohn/GIT/CRMDA/software/kutils/package/kutils.Rcheck’
* using R version 3.4.4 (2018-03-15)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* using option ‘--as-cran’
* checking for file ‘kutils/DESCRIPTION’ ... OK
* checking extension type ... Package
* this is package ‘kutils’ version ‘1.40’
* checking CRAN incoming feasibility ...
*** caught segfault ***
address 0x68456, cause 'memory not mapped'
Traceback:
1: curlGetHeaders(u)
2: doTryCatch(return(expr), name, parentenv, handler)
3: tryCatchOne(expr, names, parentenv, handlers[[1L]])
4: tryCatchList(expr, classes, parentenv, handlers)
5: tryCatch(curlGetHeaders(u), error = identity)
6: .fetch(u)
7: .check_http_A(u)
8: FUN(X[[i]], ...)
9: lapply(urls[pos], .check_http)
10: do.call(rbind, lapply(urls[pos], .check_http))
11: check_url_db(url_db_from_package_sources(dir), remote = !localOnly)
12: doTryCatch(return(expr), name, parentenv, handler)
13: tryCatchOne(expr, names, parentenv, handlers[[1L]])
14: tryCatchList(expr, classes, parentenv, handlers)
15: tryCatch(check_url_db(url_db_from_package_sources(dir), remote = !localOnly), error = identity)
16: .check_package_CRAN_incoming(pkgdir, localOnly)
17: check_CRAN_incoming(!check_incoming_remote)
18: tools:::.check_packages()
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault
It's not really an explanation of the problem or a satisfactory answer but I examined the codes more closely and figured out that in the first example, the problem appears when using acast
from the reshape2
package. I deleted it in this case because I realized it's not actually needed there but it can be replaced with reshape
from the reshape
package (as shown in another question): reshape(input, idvar="x", timevar="y", direction="wide")[-1]
.
As for the second example, it's not easy to find the exact cause of the problem but as a workaround in my case helped to set a smaller number of cores used for parallel computation - the cluster has 48, I was using only 15 since even before this issue R was running out of memory if the code was run using all 48 cores. When I reduced the number of cores to 10 it suddenly started working like before.
To add to @pauljohn32's response, this can also happen if you are using sourceRcpp
to source a C++ code say A.cpp
that relies on C++ code, say B.cpp
and C.cpp
, that was compiled against an older/different library.
An easy solution, in Linux, is to remove B.o
and C.o
files before running sourceRcpp("A.cpp")
. This seems to also automatically recompile the dependent files, assuming you have the headers included in A.cpp
.
EDIT with more details in response to Matt Nolan: Regarding the original question, there the problem is most likely similar, with shared libraries having been compiled for an older version of the OS or a different system. What I am saying here is that even if you have written and compiled the entire project yourself, this could still happen if you forget to clean up outdated files.
To give an analogy relevant to the question:
Digging into the source code for ncdf4
package referenced in the question, we find the following snippet in src\ncdf.c
#include <stdio.h>
#include <netcdf.h>
#include <string.h>
#include <stdlib.h>
#include <Rdefines.h>
#include <R_ext/Rdynload.h>
Let's say the file R_ext/Rdynload.h
is part of the microsof-r-open project. This is a header file and the corresponding Rdynload.c
can be found here.
Suppose ncdf4
and microsof-r-open
were all part of a single project and you have compiled the files in open/blob/master/source/src/main
already which would have produced object file Rdynload.o
there among other things. Then, before compiling src\ncdf.c
, you upgrade the operating system (not sure if this would necessarily cause a problem) or copy the entire source code including the object files created so far to a different machine. This can inadvertently happen.
For example, you have automatic sync going on and the directory is synced with a different machine. On this different machine then you try to compile and link src\ncdf.c
. The compiler/linker does not recompile Rdynload.c
since the object file Rdynload.o
is already there. It complies src\ncdf.c
to produce src\ncdf.o
and then links it with Rdynload.o
to build a final executable.
I am not an expert here, but since perhaps Rdynload
is a dynamically linked library, the linking goes OK with no errors. But at runtime, you get the segmentation fault due to a mismatch in version between the object code for complied library Rdynload
and the object code ncdf
(?). Someone with better knowledge of the low-level machine execution can correct me here.
The solution is to purge all the object files, i.e., the files with extension *.o
in all the source directories and let the compiler recompile everything from scratch. The *.o
extension is assuming you are on a Linux machine. Other operating systems perhaps use a different extension.
In the case of a project you don't own, perhaps the solution is to reinstall the relevant libraries (assuming that they are not precompiled and get recomplied on the new machine at installation).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With