Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create/write a simple XML parser from scratch?

Tags:

xml

xml-parsing

d

How to create/write a simple XML parser from scratch?

Rather than code samples, I want to know what are the simplified, basic steps in English.

How is a good parser designed? I understand that regex should not be used in a parser, but how much is regex's role in parsing XML?

What is the recommended data structure to use? Should I use linked lists to store and retrieve nodes, attributes, and values?

I want to learn how to create an XML parser so that I can write one in D programming language.

like image 936
XP1 Avatar asked Jun 04 '11 22:06

XP1


People also ask

What is parsing in XML with example?

XML parsing is the process of reading an XML document and providing an interface to the user application for accessing the document. An XML parser is a software apparatus that accomplishes such tasks.

What is XML parser write type?

An XML parser is a software library or package that provides interfaces for client applications to work with an XML document. The XML Parser is designed to read the XML and create a way for programs to use XML.


2 Answers

If you don't know how to write a parser, then you need to do some reading. Get hold of any book on compiler-writing (many of the best ones were written 30 or 40 years ago, e.g. Aho and Ullmann) and study the chapters on lexical analysis and syntax analysis. XML is essentially no different, except that the lexical and grammar phases are not as clearly isolated from each other as in some languages.

One word of warning, if you want to write a fully-conformant XML parser then 90% of your effort will be spent getting edge cases right in obscure corners of the spec dealing with things such as parameter entities that most XML users aren't even aware of.

like image 181
Michael Kay Avatar answered Sep 18 '22 17:09

Michael Kay


for and event based parser the user need to pass it some functions (startNode(name,attrs), endNode(name) and someText(txt) likely through an interface) and call them when needed as you pass over the file

the parser will have a while loop that will alternate between reading until < and until > and do the proper conversions to the parameter types

void parse(EventParser p, File file){     string str;     while((str = file.readln('<')).length !=0){         //not using a rewritable buffer to take advantage of slicing          //but it's a quick conversion to a implementation with a rewritable buffer though         if(str.length>1)p.someText(str.chomp('<'));           str = file.readln('>');         str = str.chomp('>');          //split str in name and attrs         auto parts = str.split();         string name = parts[0];         string[string] attrs;         foreach(attribute;parts[1..$]){             auto splitAtrr = attribute.split("=");             attrs[splitAtrr[0]] = splitAtrr[1];         }          if(str[0] == '/')p.endNode(name);         else {             p.startNode(name,attrs);             if(str[str.length-1]=='/')p.endNode(name);//self closing tag         }     } } 

you can build a DOM parser on top of a event based parser and the basic functionality you'll need for each node is getChildren and getParent getName and getAttributes (with setters when building ;) )

the object for the dom parser with the above described methods:

class DOMEventParser : EventParser{     DOMNode current = new RootNode();     overrides void startNode(string name,string[string] attrs){         DOMNode tmp = new ElementNode(current,name,attrs);         current.appendChild(tmp);         current = tmp;     }     overrides void endNode(string name){         asser(name == current.name);         current = current.parent;     }     overrides void someText(string txt){         current.appendChild(new TextNode(txt));     } } 

when the parsing ends the rootnode will have the root of the DOM tree

note: I didn't put any verification code in there to ensure correctness of the xml

edit: the parsing of the attributes has a bug in it, instead of splitting on whitespace a regex is better for that

like image 42
ratchet freak Avatar answered Sep 22 '22 17:09

ratchet freak