Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse bigdata json file (wikidata) in C++ efficiently?

I have a single json file which is about 36 GB (coming from wikidata) and I want to access it more efficiently. Currently I'm using rapidjsons SAX-style API in C++ - but parsing the whole file takes on my machine about 7415200 ms (=120 minutes). I want to access the json objects inside this file, according to one of two primary keys ('name' or 'entity-key' -> i.e. 'Stack Overflow' or 'Q549037') which are inside the json object. That means I have to parse the whole file currently in the worst case.

So I thought about two approaches:

  • splitting the big file in billions of small files - with a filename that indicates the name/entity-key (i.e. Q549037.json / Stack_Overflow.json or Q549037#Stack_Overflow.json) -> not sure about the overload in storage
  • building some kind of index from the primary keys to the ftell() position in the file. Building the index should take around 120 minutes (like parsing now), but accessing should be faster then
    • i.e. use something like two std::unorderedmap (could run into memory problems again)
    • index files - create two files: one with entries sorted by name and one sorted by entity-key (creating these files will probably take much longer, because of the sorting)

What is the best-practice for a problem like this? Which approach should I follow? Any other ideas?

like image 789
Constantin Avatar asked Feb 08 '15 06:02

Constantin


2 Answers

I think the performance problem is not due to parsing. Using RapidJSON's SAX API should already give good performance and memory friendly. If you need to access every values in the JSON, this may already be the best solution.

However, from the question description, it seems reading all values at a time is not your requirement. You want to read some (probably small amount) values of particular criteria (e.g., by primary keys). Then reading/parsing everything is not suitable for this case.

You will need some indexing mechanism. Doing that with file position may be possible. If data at the positions also a valid JSON, you can seek and stream it to RapidJSON to parse that JSON value (RapidJSON can stop parsing when a complete JSON is parsed, by kParseStopWhenDoneFlag).

Other options are converting the JSON into some kind of database, either SQL database, key-value database or custom ones. With the provided indexing facilities, you shall query the data fast. This may take long time for conversion, but good performance for later retrieval.

Note that, JSON is an exchange format. It was not designed for fast individual queries on big data.


Update: Recently I found that there is a project semi-index that may suit your needs.

like image 142
Milo Yip Avatar answered Nov 12 '22 18:11

Milo Yip


Write your own JSON parser minimizing allocations and data movement. Also ditch multi character for straight ANSI. I once wrote a XML parser to parse 4GB Xml files. I tried MSXML and Xerces both had minor memory leaks that when used on that much data would actually runout of memory. My parser would actually stop memory allocations once it reached maximum nesting level.

like image 25
user2433030 Avatar answered Nov 12 '22 18:11

user2433030