Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can readLines be executed in parallel within R

Is it possible to iterative over a single text file on a single multi-core machine in parallel with R? For context, the text file is somewhere between 250-400MB of JSON output.

EDIT:

Here are some code samples I have been playing around with. To my surprise, parallel processing did not win - just basic lapply - but this could be due to user error on my part. In addition, when trying to read a number of large files, my machine choked.

## test on first 100 rows of 1 twitter file
library(rjson)
library(parallel)
library(foreach)
library(plyr)
N = 100
library(rbenchmark)
mc.cores <- detectCores()
benchmark(lapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
          llply(readLines(FILE, n=N, warn=FALSE), fromJSON),
          mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
          mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON, 
                   mc.cores=mc.cores),
          foreach(x=readLines(FILE, n=N, warn=FALSE)) %do% fromJSON(x),
          replications=100)

Here is a second code sample

parseData <- function(x) {
  x <- tryCatch(fromJSON(x), 
                error=function(e) return(list())
                )
  ## need to do a test to see if valid data, if so ,save out the files
  if (!is.null(x$id_str)) {
    x$created_at <- strptime(x$created_at,"%a %b %e %H:%M:%S %z %Y")
    fname <- paste("rdata/",
                   format(x$created_at, "%m"),
                   format(x$created_at, "%d"),
                   format(x$created_at, "%Y"),
                   "_",
                   x$id_str,
                   sep="")
    saveRDS(x, fname)
    rm(x, fname)
    gc(verbose=FALSE)
  }
}

t3 <- system.time(lapply(readLines(FILES[1], n=-1, warn=FALSE), parseData))
like image 480
Btibert3 Avatar asked Nov 26 '12 19:11

Btibert3


1 Answers

The answer depends on what the problem actually is: reading the file in parallel, or processing the file in parallel.

Reading in parallel

You could split the JSON file into multiple input files and read them in parallel, e.g. using the plyr functions combined with a parallel backend:

result = ldply(list.files(pattern = ".json"), readJSON, .parallel = TRUE)

Registering a backend can probably be done using the parallel package which is now integrated in base R. Or you can use the doSNOW package, see this post on my blog for details.

Processing in parallel

In this scenario your best bet is to read the entire dataset into a vector of characters, split the data and then use a parallel backend combined with e.g. the plyr functions.

like image 121
Paul Hiemstra Avatar answered Sep 20 '22 23:09

Paul Hiemstra