Is it possible to iterative over a single text file on a single multi-core machine in parallel with R? For context, the text file is somewhere between 250-400MB of JSON output.
EDIT:
Here are some code samples I have been playing around with. To my surprise, parallel processing did not win - just basic lapply - but this could be due to user error on my part. In addition, when trying to read a number of large files, my machine choked.
## test on first 100 rows of 1 twitter file
library(rjson)
library(parallel)
library(foreach)
library(plyr)
N = 100
library(rbenchmark)
mc.cores <- detectCores()
benchmark(lapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
llply(readLines(FILE, n=N, warn=FALSE), fromJSON),
mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON),
mclapply(readLines(FILE, n=N, warn=FALSE), fromJSON,
mc.cores=mc.cores),
foreach(x=readLines(FILE, n=N, warn=FALSE)) %do% fromJSON(x),
replications=100)
Here is a second code sample
parseData <- function(x) {
x <- tryCatch(fromJSON(x),
error=function(e) return(list())
)
## need to do a test to see if valid data, if so ,save out the files
if (!is.null(x$id_str)) {
x$created_at <- strptime(x$created_at,"%a %b %e %H:%M:%S %z %Y")
fname <- paste("rdata/",
format(x$created_at, "%m"),
format(x$created_at, "%d"),
format(x$created_at, "%Y"),
"_",
x$id_str,
sep="")
saveRDS(x, fname)
rm(x, fname)
gc(verbose=FALSE)
}
}
t3 <- system.time(lapply(readLines(FILES[1], n=-1, warn=FALSE), parseData))
The answer depends on what the problem actually is: reading the file in parallel, or processing the file in parallel.
You could split the JSON file into multiple input files and read them in parallel, e.g. using the plyr
functions combined with a parallel backend:
result = ldply(list.files(pattern = ".json"), readJSON, .parallel = TRUE)
Registering a backend can probably be done using the parallel
package which is now integrated in base R. Or you can use the doSNOW
package, see this post on my blog for details.
In this scenario your best bet is to read the entire dataset into a vector of characters, split the data and then use a parallel backend combined with e.g. the plyr
functions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With