Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

High performance XML parsing in C++

Tags:

c++

parsing

xml

Well a lot of questions have been made about parsing XML in C++ and so on... But, instead of a generic problem, mine is very specific.

I am asking for a very efficient XML parser for C++. In particular I have a VERY VERY BIG XML file to parse. My application must open this file and retrieve data. It must also insert new nodes and save the final result in the file again.

To do this I used, at the beginning, rapidxml, but it requires me to open the file, parse it all (all the content because this lib has no functions to access the file directly without loading the entire tree first), then edit the tree, modify it and store the final tree on the file by overwriting it... It consumes too much resources.

Is there an XML parser that does not require me to load the entire file, but that I can use to insert, quickly, new nodes and retrieve data? Can you please indicate solutions for this problem of mine?

like image 285
Andry Avatar asked Jan 12 '11 20:01

Andry


People also ask

Which XML parser is faster?

DOM Parser is faster than SAX Parser. Best for the larger sizes of files. Best for the smaller size of files. It is suitable for making XML files in Java.

What is XML parser in C?

The Oracle XML parser for C reads an XML document and uses DOM or SAX APIs to provide programmatic access to its content and structure. You can use the parser in validating or nonvalidating mode. This chapter assumes that you are familiar with the following technologies: Document Object Model (DOM).

Which XML parser is more memory efficient?

The SAX event-based parser is faster and consumes far less memory than the DOM parser; consequently, it allows developers to parse the data out of an XML document more effectively.


4 Answers

If you really seek high performance XML stream parser then libhpxml is likely the right thing for you.

like image 131
Rahra Avatar answered Sep 28 '22 06:09

Rahra


You want a streaming XML parser rather than what is called a DOM parser.

There are two types of streaming parsers: pull and push. A pull parser is good for quickly writing XML parsers that load data into program memory. A push parser is good for writing a program to translate one document to another (which is what you are trying to accomplish). I think, therefore, that a push parser would be best for your problem.

In order to use a push parser, you need to write what is essentially an event handler for parsing events. By "parsing event", I mean events like "start tag reached", "end tag reached", "text found", "attribute parsed", etc.

I suggest that as you read in the document, you write out the transformed document to a separate, temporary file. Thus, your XML parsing event handlers will need to be written so that they are stateful and write out the XML of the translated document incrementally.

Three excellent push parser libraries for C++ include Expat, Xerces-C++, and libxml2.

like image 33
Daniel Trebbien Avatar answered Sep 28 '22 04:09

Daniel Trebbien


Search for "SAX parser". They are mostly tokenizers, i.e. they emit tag by tag without building a tree.

like image 26
Eugene Mayevski 'Callback Avatar answered Sep 28 '22 05:09

Eugene Mayevski 'Callback


SAX parsers are faster than DOM parsers because DOM parsers read the entire file into memory before building an in-memory representation of the XML document, whereas a SAX parser behaves like an event listener and builds the document as it reads in the file. Go here for an explanation.

As you mentioned Xerces is a good C++ SAX parser.

I would recommend looking into ways of breaking the XML document into smaller XML documents as that seems to be part of your problem.

like image 33
David Weiser Avatar answered Sep 28 '22 04:09

David Weiser