Can readLines be executed in parallel within R

Question

Is it possible to iterative over a single text file on a single multi-core machine in parallel with R? For context, the text file is somewhere between 250-400MB of JSON output.

EDIT:

Here are some code samples I have been playing around with. To my surprise, parallel processing did not win - just basic lapply - but this could be due to user error on my part. In addition, when trying to read a number of large files, my machine choked.

## test on first 100 rows of 1 twitter file
library(rjson)
library(parallel)
library(foreach)
library(plyr)
N = 100
library(rbenchmark)
mc.cores <- detectCores()
benchmark(lapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
          llply(readLines(FILE, n=N, warn=FALSE), fromJSON),
          mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
          mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON, 
                   mc.cores=mc.cores),
          foreach(x=readLines(FILE, n=N, warn=FALSE)) %do% fromJSON(x),
          replications=100)

Here is a second code sample

parseData <- function(x) {
  x <- tryCatch(fromJSON(x), 
                error=function(e) return(list())
                )
  ## need to do a test to see if valid data, if so ,save out the files
  if (!is.null(x$id_str)) {
    x$created_at <- strptime(x$created_at,"%a %b %e %H:%M:%S %z %Y")
    fname <- paste("rdata/",
                   format(x$created_at, "%m"),
                   format(x$created_at, "%d"),
                   format(x$created_at, "%Y"),
                   "_",
                   x$id_str,
                   sep="")
    saveRDS(x, fname)
    rm(x, fname)
    gc(verbose=FALSE)
  }
}

t3 <- system.time(lapply(readLines(FILES[1], n=-1, warn=FALSE), parseData))

Paul Hiemstra · Accepted Answer

The answer depends on what the problem actually is: reading the file in parallel, or processing the file in parallel.

Reading in parallel

You could split the JSON file into multiple input files and read them in parallel, e.g. using the plyr functions combined with a parallel backend:

result = ldply(list.files(pattern = ".json"), readJSON, .parallel = TRUE)

Registering a backend can probably be done using the parallel package which is now integrated in base R. Or you can use the doSNOW package, see this post on my blog for details.

Processing in parallel

In this scenario your best bet is to read the entire dataset into a vector of characters, split the data and then use a parallel backend combined with e.g. the plyr functions.

Can readLines be executed in parallel within R

Tags:

r

parallel-processing

Btibert3

1 Answers

Reading in parallel

Processing in parallel

Paul Hiemstra

Recent Activity

Donate For Us

Can readLines be executed in parallel within R

Tags:

r

parallel-processing

Btibert3

1 Answers

Reading in parallel

Processing in parallel

Paul Hiemstra

Related questions

Recent Activity

Donate For Us