I have a single json file which is about 36 GB (coming from wikidata) and I want to access it more efficiently. Currently I'm using rapidjsons SAX-style API in C++ - but parsing the whole file takes on my machine about 7415200 ms (=120 minutes). I want to access the json objects inside this file, according to one of two primary keys ('name' or 'entity-key' -> i.e. 'Stack Overflow' or 'Q549037') which are inside the json object. That means I have to parse the whole file currently in the worst case.
So I thought about two approaches:
ftell()
position in the file. Building the index should take around 120 minutes (like parsing now), but accessing should be faster then
std::unorderedmap
(could run into memory problems again)What is the best-practice for a problem like this? Which approach should I follow? Any other ideas?
I think the performance problem is not due to parsing. Using RapidJSON's SAX API should already give good performance and memory friendly. If you need to access every values in the JSON, this may already be the best solution.
However, from the question description, it seems reading all values at a time is not your requirement. You want to read some (probably small amount) values of particular criteria (e.g., by primary keys). Then reading/parsing everything is not suitable for this case.
You will need some indexing mechanism. Doing that with file position may be possible. If data at the positions also a valid JSON, you can seek and stream it to RapidJSON to parse that JSON value (RapidJSON can stop parsing when a complete JSON is parsed, by kParseStopWhenDoneFlag
).
Other options are converting the JSON into some kind of database, either SQL database, key-value database or custom ones. With the provided indexing facilities, you shall query the data fast. This may take long time for conversion, but good performance for later retrieval.
Note that, JSON is an exchange format. It was not designed for fast individual queries on big data.
Update: Recently I found that there is a project semi-index that may suit your needs.
Write your own JSON parser minimizing allocations and data movement. Also ditch multi character for straight ANSI. I once wrote a XML parser to parse 4GB Xml files. I tried MSXML and Xerces both had minor memory leaks that when used on that much data would actually runout of memory. My parser would actually stop memory allocations once it reached maximum nesting level.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With