Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse an XML in standard C/C++ without additional libraries

Tags:

c++

c

parsing

xml

I have an XML (assuming it is valid) and I must parse it and store it in a tree.

What is the best approach to parse it, without using other libraries, just basic manipulation of strings?

Keep in mind that I don't have to validate it, just parse and memorize it into a tree.

like image 343
Andrei Avatar asked Apr 01 '26 12:04

Andrei


2 Answers

The basic structure of XML is quite simple:

<tagname [attribute[="value"] ...]>content</tagname>

where the content may contain both normal text and more XML structures, or the special form

<tagname [attribute[="value"] ...]/>

which is equivalent to

<tagname [attribute[="value"] ...]></tagname>

that is,. empty content.

So if you don't need to interpret a DTD or do other fancy things, you can do the following:

  1. Check that the first non-whitespace character is <. If not, you don't have XML and can just give an error and exit.

  2. Now follows the tag name, until the first whitespace, or the / or the > character. Store that.

  3. If the next non-whitespace character is /, check that it is followed by >. If so, you've finished parsing and can return your result. Otherwise, you've got malformed XML, and can exit with an error.

  4. If the character is >, then you've found the end of the begin tag. Now follows the content. Continue at step 6.

  5. Otherwise what follows is an argument. Parse that, store the result, and continue at step 3.

  6. Read the content until you find a < character.

  7. If that character is followed by /, it's the end tag. Check that it is followed by the tag name and >, and if yes, return the result. Otherwise, throw an error.

  8. If you get here, you've found the beginning of a nested XML. Parse that with this algorithm, and then continue at 6.

like image 156
celtschk Avatar answered Apr 04 '26 03:04

celtschk


Reading XML looks simple but doing it correctly involves a few complexities you don't really want to deal with. Indeed, writing a simple XML parser effectively amounts to creating yet another XML library. I have done it and an incomplete version of this is sitting somewhere on my disk. Even if you don't need to validate your XML structure:

  • whether you validate or not, you need to deal with entity references like &lt; and the variety of character entity references like &#65; and &#xa;
  • the plain body of an XML document is relatively simple but the header a major pain to deal with in particular the DTD: there are two versions thereof which are slightly different and you probably need to process the inline DTD
  • even the body isn't entirely trivial because of these annoying character data segments
  • even without validation you may need to support external entity references
  • the characters to be accepted and/or rejected for various parts of XML are also somewhat interesting
  • note that XML is defined in terms of Unicode and proper handling of this isn't entirely trivial either: just using char or wchar_t just doesn't cut it.

The first version I implemented was a nice little iterator intended to pop out all the elements encountered. This allowed for the nice feature of easily stopping and continuing the parsing at the choice of the iterator user. Unfortunately, I didn't get it to fly when trying to copy with the various entity references. It would parse simple XML files nice and fast but some quirks in the specification I just didn't get right.

What worked best for me was creating a simple recursive decent parser combined with a suitable stack of buffers to somewhat transparently deal with entity references. However, to finish this completely I still need to deal with some encoding issues and in the end I just had higher priority projects to work on (in my spare time, that is).

In summary: it can be done, obviously, as others did. It is probably a somewhat pointless exercise unless you have a really bright idea which makes your implementation uniquely better suited than the alternatives.

like image 25
Dietmar Kühl Avatar answered Apr 04 '26 03:04

Dietmar Kühl



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!