I receive json-files with data to be analyzed in R, for which I use the RJSONIO-package:
library(RJSONIO)
filename <- "Indata.json"
jFile <- fromJSON(filename)
When the json-files are larger than about 300MB (uncompressed), my computer starts to use the swap memory and continues the parsing (fromJSON) for hours. A 200MB-file takes only about one minute to parse.
I use R 2.14 (64bit) on Ubuntu 64bit with 16GB RAM, so I'm surprised that swapping is needed already at about 300MB of json.
What can I do to read big jsons? Is there something in the memory-settings that mess things up? I have restarted R and run only the three lines above. The json-file contain 2-3 columns with short strings, and 10-20 columns with numbers from 0 to 1000000. I.e. it is the number of rows that makes the large size (more than a million rows in the parsed data).
Update: From the comments I learned that rjson is done more in C, so I tried it. A 300MB file that with RJSONIO (according to Ubuntu System Monitor) reached 100% memory use (from 6% baseline) and went on to swapping, needed only 60% memory with package rjson and the parsing was done in reasonable time (minutes).
With Gigasheet, you can open large JSON files with millions of rows or billions of cells, and work with them just as easily as you'd work with a much smaller file in Excel or Google Sheets.
Instead of reading the whole file at once, the 'chunksize' parameter will generate a reader that gets a specific number of lines to be read every single time and according to the length of your file, a certain amount of chunks will be created and pushed into memory; for example, if your file has 100.000 lines and you ...
Although JSON is referred to as comparatively better than CSV when dealing with massive data sets and in terms of scalability of files or applications, you should avoid this format when working with big data. There are more efficient alternatives.
Although your question doesn't specify this detail, you may want to make sure that loading the entire JSON in memory is actually what you want. It looks like RJSONIO is a DOM-based API.
What computation do you need to do? Can you use a streaming parser? An example of a SAX-like streaming parser for JSON is yajl.
Even though the question is very old, this might be of use for someone with a similar problem.
The function jsonlite::stream_in()
allows to define pagesize
to set the number of lines read at a time, and a custom function that is applied to this subset in each iteration can be provided as handler
. This allows working with very large JSON-files without reading everything into memory at the same time.
stream_in(con, pagesize = 5000, handler = function(x){
# Do something with the data here
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With