Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Steps and involvement of implementing a parser (in .Net - and in this case XPath 2.0)

In the lack of any good free XPath 2.0 implementations for .Net build upon Linq to XML I have thought about implementing my own (also for the experience). But just to be clear (and not building something that exists) these are the XPath 2.0 implementations I have found:

  • Saxon .Net
  • Query Machine - I had problems with this - exceptions with the examples
  • XQSharp - may be good, but is commercial (single developer ~300 $)

Now, I want some thoughts on how difficult it is to implementing some language such as XPath 2.0 expressions. I have found this link which have a EBNF for XPath 2.0 expression: http://www.w3.org/TR/2007/REC-xpath20-20070123/#id-grammar and I'm thinking of making it in F# with the fslex/fsyacc combination.

My background (subjective): I have played with these tools before, but only for some simple expressions and a very simple programming language. Furthermore, I have read most of the Dragon book and Appel´s Modern compiler implementation in ML - but unfortunately, I have not put the theory in practice while reading. I've studied computer science in a year now where I have completed courses with theory about ex finite automaton, CFL and algorithms but I have been a developer for years before university (a few years with professional jobs - back-end of websites mainly).

Now, the steps of parsing and what I tend to cover:

  1. Lex - Parsing - Reductions: FsLex/FsYacc. I will properly not cover ALL of Xpath 2.0 at first but at least all of what XPath 1.0 can do + a little more.
  2. Sematic analysis - I'm not sure about how much there is to this
  3. Optimization - I do not tend to cover this (at least not at first)
  4. Actual traversing etc.
  5. ...?

Now, the concrete questions in addition to the above:

  1. How difficult is it to make a parser of this size? based on my background, would I could to it?
  2. Is there any crucial steps I have missed in regards to XPath 2.0 in particular?
  3. Is there any technology I have missed; Do I have to cover more than just XPath 2.0 and XDocument etc. to be able to make the parser?

To be clear: I want to make a XPath 2.0 expression parser and traverse XDocument etc. with this parsed expression. Which I guess combined is a query engine.

Update: I found this: http://www.w3.org/2007/01/applets/xpathApplet.html which contains code to parsing and traversing. I think it would be a nice start or reference :-)

Your answers will be appreciated.

like image 985
Lasse Espeholt Avatar asked Aug 24 '10 09:08

Lasse Espeholt


3 Answers

I am one of the developers of XQSharp, so I have experience in this area. XQSharp actually began its life as an XPath implementation before we expanded it to support XQuery.

Our initial implementation took us about 6 months, although this was not the only thing we were working on at the time.

After this time we had an implementation that was feature complete. There were many areas in which this was not fully conformant, where the standard .NET methods did not behave quite the same as the specification required. Some examples of this are with converting values to and from strings, regular expressions, a lot of unicode stuff, problems with the .NET representations of XML (eg handling of xml:base) and so on.

There were several areas that needed to be done to implement this:

Parsing: The parser itself was straightforward, and mostly generated from the EBNF in the spec. I would estimate that this initially represented a few weeks work.

Data Model: How the data is represented. In order to have a full XPath implementation there are a lot of new data types (like xs:gDay) that need to be implemented. In our case we have all our items derive from a base type and all our expressions would return enumerators of these. You also need to be able to identify whether the type of an item matches a particular XPath type. We supported static typing and schema-awareness from the outset, without these features this section probably becomes trivial, but you are still looking at several weeks worth of work.

Expressions/Abstract Syntax Tree This is the model of the expression itself. We used the XQuery Formal Semantics document to produce a mapping from the various XPath constructs (for example axes and predicates) to a simpler core grammer (which ends up with huge amounts of let, for if and typeswitch expressions!). In our initial implementation all these expressions had evaluate methods and so were the final representation of the expression. In our case the expressions all had type check methods too, but this can be skipped initially (The main purpose of these is for optimization). Creating all these expressions again took several weeks.

Functions As a previous commenter pointed out the function library for XPath is rather large. The entire XPath library took us several months to implement.

Static Analysis A small amount of static analysis is required. Variable references and function calls must be bound to the correct variables and functions. Most XPath implementations are stack based, and so a stack allocation phase is required to assign pointers (or indexes) to all the variables. This static analysis took a week or two. The Dragon Book should set you up very nicely to solve most of these problems.

You are probably looking at another month's worth of work for all the extra bits of work that do not fall directly into these categories.

After all this work we were left with a mostly functional implementation of XPath; but it was far to slow for real world use (maybe 100x slower than XPath 1 in .NET). So after this comes the fun work - Optimization.

Bringing the engine up to 100% conformance and adding optimizations probably took another 12-18 man months (although we probably went a little overboard with optimization!), but by that point we had already made the transition to being an XQuery implementation.

My advice would be to start by tackling a subset of XPath (maybe forward axes only and a very limited function library) and you might be able to knock together an implementation in a month or two, but a serious XPath2 implementation will be a big investment in time.

Make sure that you use XPathNavigator for your node representation, as it has methods like SelectChildren which can take advantages of indexes in the underlying representations (for example XPathDocument).

like image 65
Oliver Hallam Avatar answered Nov 15 '22 01:11

Oliver Hallam


I implemented an XPath 2.0 parser entirely in XSLT 2.0 three years ago.

I used my LR Parsing Framework in FXSL and this was not so difficult. The grammar is quite big -- 209 rules, if I remember well. I used a modificationn of YACC (done by me) which I call Yaccx to generate the parsing tables as XML. These are the input for the general LR Parser, written in XSLT.

For such kind of project you need to allocate at least 6 months full time, maybe 1 year. The difficulty is in implementing the enormous function library (F & O).

Also, XPath is not a standalone language -- it must be hosted by another language. Due to this reason I didn't use this parser for anything meaningful, as I didn't have the access, influence and possibility to alter an existing hosting language.

So, be prepared for all these difficulties.

like image 30
Dimitre Novatchev Avatar answered Nov 15 '22 02:11

Dimitre Novatchev


To address your third concrete question, the Dragon Book makes no mention of Parsing Expression Grammars (PEGs)/Packrat parsers/parser combinator libraries, which are quite the rage now, especially when it comes to functional languages. See FParsec, for example.

like image 3
ig2r Avatar answered Nov 15 '22 01:11

ig2r