Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

append multiple large data.table's; custom data coercion using colClasses and fread; named pipes

[This is kind of multiple bug-reports/feature requests in one post, but they don't necessarily make sense in isolation. Apologies for the monster post in advance. Posting here as suggested by help(data.table). Also, I'm new to R; so apologies if I'm not following best practices in my code below. I'm trying.]

1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)

First I want to report that using rbindlist to append large data.tables causes R to segfault (ubuntu 13.10, the packaged R version 3.0.1-3ubuntu1, data.table installed from within R from CRAN). The machine has 128 GiB of RAM; so, I shouldn't be running out of memory given the size of the data.

My code:

append.tables <- function(files) {
    moves.by.year <- lapply(files, fread)
    move <- rbindlist(moves.by.year)
    rm(moves.by.year)
    move[,week_end := as.Date(as.character(week_end), format="%Y%m%d")]
    return(move)
}

Crash message:

 append.tables crashes with this:
> system.time(move <- append.tables(files))
 *** caught segfault ***
address 0x7f8e88dc1d10, cause 'memory not mapped'

Traceback:
 1: rbindlist(moves.by.year)
 2: append.tables(files)
 3: system.time(move <- append.tables(files))

There are 6 files, each about 8 GiB or 100 million lines long with 8 variables, tab separated.

2. Could fread accept multiple file names?

In any case, I think a better approach here would be to allow fread to take files as a vector of file names:

files <- c("my", "files", "to be", "appended")
dt <- fread(files)

Presumably you can be much more memory efficient under the hood than without having to keep all of these objects around at the same time as appears to be necessary as a user of R.

3. colClasses gives an error message

My second problem is that I need to specify a custom coercion handler for one of my data types, but that fails:

dt <- fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) : 
  Column name 'myDate' in colClasses not found in data

Yes, in the case of dates, a simple:

    dt[,date := as.Date(as.character(date), format="%Y%m%d")]

works.

However, I have a different use case, which is to strip the decimal point from one of the data columns before it is converted from a character. Precision here is extremely important (thus our need for using the integer type), and coercing to an integer from the double type results in lost precision.

Now, I can get around this with some system() calls to append the files and pipe them through some sed magic (simplified here) (where tfile is another temporary file):

if (has_header) {
    tfile2 <- tempfile()
    system(paste("echo fakeline >>", tfile2))
    system(paste("head -q -n1", files[[1]], ">>", tfile2))
    system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
                 " | sed 's/\\.//' >>", tfile), wait=wait)
    unlink(tfile2)
} else {
    system(paste("cat", paste(files, collapse=" "), ">>", tfile), wait=wait)
}

but this involves an extra read/write cycle. I have 4 TiB of data to process, which is a LOT of extra reading and writing (no, not all into one data.table. About 1000 of them.)

4. fread thinks named pipes are empty files

I typically leave wait=TRUE. But I was trying to see if I could avoid the extra read/write cycle by making tfile a named pipe system('mkfifo', tfile), setting wait=FALSE, and then running fread(tfile). However, fread complains about the pipe being an empty file:

system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
             " | sed 's/\\.//' >>", tfile), wait=FALSE)
move <- fread(tfile)
Error in fread(tfile) : File is empty: /tmp/RtmpbxNI1L/file78a678dc1999

In any case, this is a bit of a hack.

Simplified Code if I had my wish list

Ideally, I would be able to do something like this:

setClass("Int_Price")
setAs("character", "Int_Price",
    function (from) {
        return(as.integer(gsub("\\.", "", from)))
    }
)

dt <- fread(files, colClasses=list(price="Int_Price"))

And then I'd have a nice long data.table with properly coerced data.

like image 434
James Avatar asked Jan 19 '14 17:01

James


1 Answers

Update: The rbindlist bug has been fixed in commit 1100 v1.8.11. From NEWS:

o Fixed a rare segfault that occurred on >250m rows (integer overflow during memory allocation); closes #5305. Thanks to Guenter J. Hitsch for reporting.


As mentioned in comments, you're supposed to ask separate questions separately. But since they're such good points and linked together into the wish at the end, ok, will answer in one go.

1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)

Please run again changing the line :

moves.by.year <- lapply(files, fread)

to

moves.by.year <- lapply(files, fread, verbose=TRUE)

and send me the output. I don't think it is the size of the files, but something about the type and contents. You're right that fread and rbindlist should have no issue loading the 48GB of data on your 128GB box. As you say, the lapply should return 48GB and then the rbindlist should create a new 48GB single table. This should work on your 128GB machine since 96GB < 128GB. 100 million rows * 6 is 600 million rows, which is well under the 2 billion row limit so should be fine (data.table hasn't caught up with long vector support in R3 yet, otherwise > 2^31 rows would be fine, too).

2. Could fread accept multiple file names?

Excellent idea. As you say, fread could then sweep through all 6 files detecting their types and counting the total number of rows, first. Then allocate once for the 600 million rows directly. This would save churning through 48GB of RAM needlessly. It might also detect any anomalies in the 5th or 6th file (say) before starting to read the first files, so would return quicker in the event of problems.

I'll file this as a feature request and post the link here.

3. colClasses gives an error message

When type list, the type appears to the left of the = and a vector of column names or positions appears to the right. The idea is to be easier than colClasses in read.csv which only accepts a vector; to save repeating "character" over and over. I could have sworn this was better documented in ?fread but it seems not. I'll take a look at that.

So, instead of

fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) : 
    Column name 'myDate' in colClasses not found in data

the correct syntax is

fread(tfile, colClasses=list(myDate="date"))

Given what you go on to say in the question, iiuc, you actually want :

fread(tfile, colClasses=list(character="date"))  # just fread accepts list

or

fread(tfile, colClasses=c("date"="character"))   # both read.csv and fread

Either of those should load the column called "date" as character so you can manipulate it before coercion. If it really is just dates, then I've still to implement that coercion automatically. You mentioned precision of numeric so just to remind that integer64 can be read directly by fread too.

4. fread thinks named pipes are empty files

Hopefully this goes away now assuming the previous point is resolved? fread works by memory mapping its input. It can accept non-files such as http addresses and connections (tbc) and what it does first for convenience is to write the complete input to ramdisk so it can map the input from there. The reason fread is fast is hand in hand with seeing the entire input first.

like image 123
Matt Dowle Avatar answered Sep 17 '22 23:09

Matt Dowle