Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read big json?

Tags:

c++

json

c

r

I receive json-files with data to be analyzed in R, for which I use the RJSONIO-package:

library(RJSONIO)
filename <- "Indata.json"
jFile <- fromJSON(filename)

When the json-files are larger than about 300MB (uncompressed), my computer starts to use the swap memory and continues the parsing (fromJSON) for hours. A 200MB-file takes only about one minute to parse.

I use R 2.14 (64bit) on Ubuntu 64bit with 16GB RAM, so I'm surprised that swapping is needed already at about 300MB of json.

What can I do to read big jsons? Is there something in the memory-settings that mess things up? I have restarted R and run only the three lines above. The json-file contain 2-3 columns with short strings, and 10-20 columns with numbers from 0 to 1000000. I.e. it is the number of rows that makes the large size (more than a million rows in the parsed data).


Update: From the comments I learned that rjson is done more in C, so I tried it. A 300MB file that with RJSONIO (according to Ubuntu System Monitor) reached 100% memory use (from 6% baseline) and went on to swapping, needed only 60% memory with package rjson and the parsing was done in reasonable time (minutes).

like image 629
Chris Avatar asked Nov 21 '11 18:11

Chris


People also ask

How do I view large JSON files?

With Gigasheet, you can open large JSON files with millions of rows or billions of cells, and work with them just as easily as you'd work with a much smaller file in Excel or Google Sheets.

How do you handle large JSON data?

Instead of reading the whole file at once, the 'chunksize' parameter will generate a reader that gets a specific number of lines to be read every single time and according to the length of your file, a certain amount of chunks will be created and pushed into memory; for example, if your file has 100.000 lines and you ...

Is JSON good for big data?

Although JSON is referred to as comparatively better than CSV when dealing with massive data sets and in terms of scalability of files or applications, you should avoid this format when working with big data. There are more efficient alternatives.


2 Answers

Although your question doesn't specify this detail, you may want to make sure that loading the entire JSON in memory is actually what you want. It looks like RJSONIO is a DOM-based API.

What computation do you need to do? Can you use a streaming parser? An example of a SAX-like streaming parser for JSON is yajl.

like image 98
Will Bradley Avatar answered Sep 30 '22 00:09

Will Bradley


Even though the question is very old, this might be of use for someone with a similar problem.

The function jsonlite::stream_in() allows to define pagesize to set the number of lines read at a time, and a custom function that is applied to this subset in each iteration can be provided as handler. This allows working with very large JSON-files without reading everything into memory at the same time.

stream_in(con, pagesize = 5000, handler = function(x){
    # Do something with the data here
})
like image 32
tobiasegli_te Avatar answered Sep 30 '22 00:09

tobiasegli_te