Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the quickest way to write out a large data frame as json in R?

Tags:

json

r

I need to write a large data frame to a file as JSON in R. I am using the rjson package. The approach below is quite slow...

for (i in 1:nrow(df)) {
      write.table(toJSON(df[i,]),"[FILENAME]",
      row.names=FALSE,col.names=FALSE,quote=FALSE,append=TRUE)
    }

So I tried this:

write.table(toJSON(df),"FILENAME]",
            row.names=FALSE,col.names=FALSE,quote=FALSE,append=TRUE)

Which chokes because toJSON() cannot handle a string that is of very long length. So I would like to perhaps write out chunks of my data table at a time. What is the recommended approach for this? If it involves split() could you provide some psuedocode?

like image 785
Joe Cartano Avatar asked Sep 20 '13 00:09

Joe Cartano


2 Answers

Here's a big(ger) data set

big = iris[rep(seq_len(nrow(iris)), 1000),]

the for loop with toJSON(df[i,]) creates a flat file of key-value pairs representing each row, whereas toJSON(df) produces column vectors; these are very different. We aim for the equivalent of toJSON(df[i,]), but formatted as a single JSON string.

Start by munging big into a list-of-lists lol with each inner element named (making the factor into a character, so as not to confuse json further down), so lol looks like list(big[1,], big[2,], ...) but with names on each element.

big1 <- Map(function(x, nm) setNames(x, rep(nm, length(x))), big, names(big))
big1$Species <- as.character(big1$Species)
lol <- unname(do.call(Map, c(list, big1)))

Then we turn it into a vector-of-json, using the rjson library and splitIndices provided by the parallel library (probably other ways of generating the split)

chunks <- 10
json <- sapply(splitIndices(length(lol), chunks), function(idx) toJSON(lol[idx]))

We could almost write the json chunks to file, but they aren't quite legitimate -- all but the last string should end with ",", but end with a "]", and all but the first should start with nothing, but instead start with "[".

substring(json[-length(json)], nchar(json)[-length(json)]) = ","
substring(json[-1], 1, 1) = ""

These are then ready to be written to file

fl <- tempfile()
writeLines(json, fl)

Combining, and certainly many special cases for column type coercion are un-handled,

library(parallel)  ## just for splitIndices; no parallel processing here...
library(json)
fastJson <- function(df, fl, chunks=10) {
    df1 = Map(function(x, nm) setNames(x, rep(nm, length(x))), df, names(df))
    df1 <- lapply(df1, function(x) {
        if (is(x, "factor")) as.character(x) else x
    })
    lol = unname(do.call(Map, c(list, df1)))

    idx <- splitIndices(length(lol), chunks)
    json <- sapply(idx, function(i) toJSON(lol[i]))
    substring(json[-length(json)], nchar(json)[-length(json)]) <- ","
    substring(json[-1], 1, 1) <- ""
    writeLines(json, fl)
}

With

> fastJson(big, tempfile())
> system.time(fastJson(big, fl <- tempfile()))
   user  system elapsed 
  2.340   0.008   2.352 
 > system(sprintf("wc %s", fl))
     10      10 14458011 /tmp/RtmpjLEh5h/file3fa75d00a57c

In contrast, just sub-setting big (without any parsing to JSON or writing to a file) takes a long time:

> system.time(for (i in seq_len(nrow(big))) big[i,])
   user  system elapsed 
 57.632   0.088  57.835 

Opening this file to append, once for each row, does not take much time compared to sub-setting

> system.time(for (i in seq_len(nrow(big))) { con <- file(fl, "a"); close(con) })
   user  system elapsed 
  2.320   0.580   2.919 
like image 154
Martin Morgan Avatar answered Oct 13 '22 19:10

Martin Morgan


What is extremely slow with your first approach is that every time you call write.table the file is opened, the handle moved to the bottom of the file, data written, then the file is closed. It will be much faster if you open the file only once and use a file handle. Like this:

fh <- file("[FILENAME]", "w")
for (i in 1:nrow(df)) {
    write.table(toJSON(df[i,]), fh,
                row.names = FALSE, col.names = FALSE, quote = FALSE)
}
close(fh)

I also removed the append = TRUE as it is implied (hence not necessary) when using a file connection.

like image 21
flodel Avatar answered Oct 13 '22 20:10

flodel