[This is kind of multiple bug-reports/feature requests in one post, but they don't necessarily make sense in isolation. Apologies for the monster post in advance. Posting here as suggested by help(data.table). Also, I'm new to R; so apologies if I'm not following best practices in my code below. I'm trying.]
rbindlist
crash on 6 * 8GB files (I have 128GB RAM)First I want to report that using rbindlist to append large data.tables causes R to segfault (ubuntu 13.10, the packaged R version 3.0.1-3ubuntu1, data.table installed from within R from CRAN). The machine has 128 GiB of RAM; so, I shouldn't be running out of memory given the size of the data.
My code:
append.tables <- function(files) {
moves.by.year <- lapply(files, fread)
move <- rbindlist(moves.by.year)
rm(moves.by.year)
move[,week_end := as.Date(as.character(week_end), format="%Y%m%d")]
return(move)
}
Crash message:
append.tables crashes with this:
> system.time(move <- append.tables(files))
*** caught segfault ***
address 0x7f8e88dc1d10, cause 'memory not mapped'
Traceback:
1: rbindlist(moves.by.year)
2: append.tables(files)
3: system.time(move <- append.tables(files))
There are 6 files, each about 8 GiB or 100 million lines long with 8 variables, tab separated.
fread
accept multiple file names?In any case, I think a better approach here would be to allow fread to take files as a vector of file names:
files <- c("my", "files", "to be", "appended")
dt <- fread(files)
Presumably you can be much more memory efficient under the hood than without having to keep all of these objects around at the same time as appears to be necessary as a user of R.
colClasses
gives an error messageMy second problem is that I need to specify a custom coercion handler for one of my data types, but that fails:
dt <- fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) :
Column name 'myDate' in colClasses not found in data
Yes, in the case of dates, a simple:
dt[,date := as.Date(as.character(date), format="%Y%m%d")]
works.
However, I have a different use case, which is to strip the decimal point from one of the data columns before it is converted from a character. Precision here is extremely important (thus our need for using the integer type), and coercing to an integer from the double type results in lost precision.
Now, I can get around this with some system() calls to append the files and pipe them through some sed magic (simplified here) (where tfile is another temporary file):
if (has_header) {
tfile2 <- tempfile()
system(paste("echo fakeline >>", tfile2))
system(paste("head -q -n1", files[[1]], ">>", tfile2))
system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
" | sed 's/\\.//' >>", tfile), wait=wait)
unlink(tfile2)
} else {
system(paste("cat", paste(files, collapse=" "), ">>", tfile), wait=wait)
}
but this involves an extra read/write cycle. I have 4 TiB of data to process, which is a LOT of extra reading and writing (no, not all into one data.table. About 1000 of them.)
fread
thinks named pipes are empty filesI typically leave wait=TRUE. But I was trying to see if I could avoid the extra read/write cycle by making tfile a named pipe system('mkfifo', tfile)
, setting wait=FALSE, and then running fread(tfile). However, fread complains about the pipe being an empty file:
system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
" | sed 's/\\.//' >>", tfile), wait=FALSE)
move <- fread(tfile)
Error in fread(tfile) : File is empty: /tmp/RtmpbxNI1L/file78a678dc1999
In any case, this is a bit of a hack.
Ideally, I would be able to do something like this:
setClass("Int_Price")
setAs("character", "Int_Price",
function (from) {
return(as.integer(gsub("\\.", "", from)))
}
)
dt <- fread(files, colClasses=list(price="Int_Price"))
And then I'd have a nice long data.table
with properly coerced data.
o Fixed a rare segfault that occurred on >250m rows (integer overflow during memory allocation); closes #5305. Thanks to Guenter J. Hitsch for reporting.
As mentioned in comments, you're supposed to ask separate questions separately. But since they're such good points and linked together into the wish at the end, ok, will answer in one go.
1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)
Please run again changing the line :
moves.by.year <- lapply(files, fread)
to
moves.by.year <- lapply(files, fread, verbose=TRUE)
and send me the output. I don't think it is the size of the files, but something about the type and contents. You're right that fread
and rbindlist
should have no issue loading the 48GB of data on your 128GB box. As you say, the lapply
should return 48GB and then the rbindlist
should create a new 48GB single table. This should work on your 128GB machine since 96GB < 128GB. 100 million rows * 6 is 600 million rows, which is well under the 2 billion row limit so should be fine (data.table
hasn't caught up with long vector support in R3 yet, otherwise > 2^31 rows would be fine, too).
2. Could fread accept multiple file names?
Excellent idea. As you say, fread
could then sweep through all 6 files detecting their types and counting the total number of rows, first. Then allocate once for the 600 million rows directly. This would save churning through 48GB of RAM needlessly. It might also detect any anomalies in the 5th or 6th file (say) before starting to read the first files, so would return quicker in the event of problems.
I'll file this as a feature request and post the link here.
3. colClasses gives an error message
When type list
, the type appears to the left of the =
and a vector of column names or positions appears to the right. The idea is to be easier than colClasses
in read.csv
which only accepts a vector; to save repeating "character"
over and over. I could have sworn this was better documented in ?fread
but it seems not. I'll take a look at that.
So, instead of
fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) :
Column name 'myDate' in colClasses not found in data
the correct syntax is
fread(tfile, colClasses=list(myDate="date"))
Given what you go on to say in the question, iiuc, you actually want :
fread(tfile, colClasses=list(character="date")) # just fread accepts list
or
fread(tfile, colClasses=c("date"="character")) # both read.csv and fread
Either of those should load the column called "date" as character so you can manipulate it before coercion. If it really is just dates, then I've still to implement that coercion automatically. You mentioned precision of numeric
so just to remind that integer64
can be read directly by fread
too.
4. fread thinks named pipes are empty files
Hopefully this goes away now assuming the previous point is resolved? fread
works by memory mapping its input. It can accept non-files such as http addresses and connections (tbc) and what it does first for convenience is to write the complete input to ramdisk so it can map the input from there. The reason fread
is fast is hand in hand with seeing the entire input first.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With