I need to write a large data frame to a file as JSON in R. I am using the rjson package. The approach below is quite slow...
for (i in 1:nrow(df)) {
write.table(toJSON(df[i,]),"[FILENAME]",
row.names=FALSE,col.names=FALSE,quote=FALSE,append=TRUE)
}
So I tried this:
write.table(toJSON(df),"FILENAME]",
row.names=FALSE,col.names=FALSE,quote=FALSE,append=TRUE)
Which chokes because toJSON() cannot handle a string that is of very long length. So I would like to perhaps write out chunks of my data table at a time. What is the recommended approach for this? If it involves split()
could you provide some psuedocode?
Here's a big(ger) data set
big = iris[rep(seq_len(nrow(iris)), 1000),]
the for loop with toJSON(df[i,])
creates a flat file of key-value pairs representing each row, whereas toJSON(df)
produces column vectors; these are very different. We aim for the equivalent of toJSON(df[i,])
, but formatted as a single JSON string.
Start by munging big
into a list-of-lists lol
with each inner element named (making the factor into a character, so as not to confuse json further down), so lol
looks like list(big[1,], big[2,], ...)
but with names on each element.
big1 <- Map(function(x, nm) setNames(x, rep(nm, length(x))), big, names(big))
big1$Species <- as.character(big1$Species)
lol <- unname(do.call(Map, c(list, big1)))
Then we turn it into a vector-of-json, using the rjson library and splitIndices
provided by the parallel library (probably other ways of generating the split)
chunks <- 10
json <- sapply(splitIndices(length(lol), chunks), function(idx) toJSON(lol[idx]))
We could almost write the json chunks to file, but they aren't quite legitimate -- all but the last string should end with ",", but end with a "]", and all but the first should start with nothing, but instead start with "[".
substring(json[-length(json)], nchar(json)[-length(json)]) = ","
substring(json[-1], 1, 1) = ""
These are then ready to be written to file
fl <- tempfile()
writeLines(json, fl)
Combining, and certainly many special cases for column type coercion are un-handled,
library(parallel) ## just for splitIndices; no parallel processing here...
library(json)
fastJson <- function(df, fl, chunks=10) {
df1 = Map(function(x, nm) setNames(x, rep(nm, length(x))), df, names(df))
df1 <- lapply(df1, function(x) {
if (is(x, "factor")) as.character(x) else x
})
lol = unname(do.call(Map, c(list, df1)))
idx <- splitIndices(length(lol), chunks)
json <- sapply(idx, function(i) toJSON(lol[i]))
substring(json[-length(json)], nchar(json)[-length(json)]) <- ","
substring(json[-1], 1, 1) <- ""
writeLines(json, fl)
}
With
> fastJson(big, tempfile())
> system.time(fastJson(big, fl <- tempfile()))
user system elapsed
2.340 0.008 2.352
> system(sprintf("wc %s", fl))
10 10 14458011 /tmp/RtmpjLEh5h/file3fa75d00a57c
In contrast, just sub-setting big (without any parsing to JSON or writing to a file) takes a long time:
> system.time(for (i in seq_len(nrow(big))) big[i,])
user system elapsed
57.632 0.088 57.835
Opening this file to append, once for each row, does not take much time compared to sub-setting
> system.time(for (i in seq_len(nrow(big))) { con <- file(fl, "a"); close(con) })
user system elapsed
2.320 0.580 2.919
What is extremely slow with your first approach is that every time you call write.table
the file is opened, the handle moved to the bottom of the file, data written, then the file is closed. It will be much faster if you open the file only once and use a file handle. Like this:
fh <- file("[FILENAME]", "w")
for (i in 1:nrow(df)) {
write.table(toJSON(df[i,]), fh,
row.names = FALSE, col.names = FALSE, quote = FALSE)
}
close(fh)
I also removed the append = TRUE
as it is implied (hence not necessary) when using a file connection.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With