How to create/write a simple XML parser from scratch?
Rather than code samples, I want to know what are the simplified, basic steps in English.
How is a good parser designed? I understand that regex should not be used in a parser, but how much is regex's role in parsing XML?
What is the recommended data structure to use? Should I use linked lists to store and retrieve nodes, attributes, and values?
I want to learn how to create an XML parser so that I can write one in D programming language.
XML parsing is the process of reading an XML document and providing an interface to the user application for accessing the document. An XML parser is a software apparatus that accomplishes such tasks.
An XML parser is a software library or package that provides interfaces for client applications to work with an XML document. The XML Parser is designed to read the XML and create a way for programs to use XML.
If you don't know how to write a parser, then you need to do some reading. Get hold of any book on compiler-writing (many of the best ones were written 30 or 40 years ago, e.g. Aho and Ullmann) and study the chapters on lexical analysis and syntax analysis. XML is essentially no different, except that the lexical and grammar phases are not as clearly isolated from each other as in some languages.
One word of warning, if you want to write a fully-conformant XML parser then 90% of your effort will be spent getting edge cases right in obscure corners of the spec dealing with things such as parameter entities that most XML users aren't even aware of.
for and event based parser the user need to pass it some functions (startNode(name,attrs)
, endNode(name)
and someText(txt)
likely through an interface) and call them when needed as you pass over the file
the parser will have a while loop that will alternate between reading until <
and until >
and do the proper conversions to the parameter types
void parse(EventParser p, File file){ string str; while((str = file.readln('<')).length !=0){ //not using a rewritable buffer to take advantage of slicing //but it's a quick conversion to a implementation with a rewritable buffer though if(str.length>1)p.someText(str.chomp('<')); str = file.readln('>'); str = str.chomp('>'); //split str in name and attrs auto parts = str.split(); string name = parts[0]; string[string] attrs; foreach(attribute;parts[1..$]){ auto splitAtrr = attribute.split("="); attrs[splitAtrr[0]] = splitAtrr[1]; } if(str[0] == '/')p.endNode(name); else { p.startNode(name,attrs); if(str[str.length-1]=='/')p.endNode(name);//self closing tag } } }
you can build a DOM parser on top of a event based parser and the basic functionality you'll need for each node is getChildren and getParent getName and getAttributes (with setters when building ;) )
the object for the dom parser with the above described methods:
class DOMEventParser : EventParser{ DOMNode current = new RootNode(); overrides void startNode(string name,string[string] attrs){ DOMNode tmp = new ElementNode(current,name,attrs); current.appendChild(tmp); current = tmp; } overrides void endNode(string name){ asser(name == current.name); current = current.parent; } overrides void someText(string txt){ current.appendChild(new TextNode(txt)); } }
when the parsing ends the rootnode will have the root of the DOM tree
note: I didn't put any verification code in there to ensure correctness of the xml
edit: the parsing of the attributes has a bug in it, instead of splitting on whitespace a regex is better for that
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With