Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does parsing an HTML/XML document work?

Tags:

dom

regex

I have been told and watched others be told very often: do not use regular expressions to parse (or "parse") a document written in a language like HTML, XML etc. The reasons named vary and are not really of importance here.

When asked what to do instead, usually you will be referred to a library for parsing such a document - a PHP extension, a JS framework etc. Most of the time they seem to rely on the document object model.

My question is not how to do this in a program or script. In a real situation I would not attempt to invent the wheel another time but just use one of the available frameworks.

What I want to know is - how do these frameworks do it? Or how would I do it without a framework (hypothetically)? I am not talking about any language in specific, I am interested in the theory behind extracting information from a document.

like image 770
Armatus Avatar asked Apr 17 '12 16:04

Armatus


1 Answers

Parsing XML requires a tool that's capable of recognizing something called a "context-free language." Regular expressions recognize regular languages, which are a subset of context-free langauges.

Recognizing Regular Languages

Regular languages are recognized by deterministic finite automata (DFAs). A DFA is a set of states with transition edges between states, and an input buffer (the string you're parsing). The DFA begins in its start state. The DFA reads off the character at the beginning of the input buffer, which tells it which transition to take. This moves the DFA to the next state, where it repeats the process. If the DFA ever encounters an input character it doesn't have a transition for, it ends (the input was not recognized). If the DFA reaches a designated end state, the input has been recognized

The most important thing to remember is that DFAs can't remember what states they've been to---just where they are right now, and where to go next. This makes it impossible for a DFA to recognize certain types of languages, like matched XML tags for example.

Regular expression implementations (like PCRE) have some extensions for convenience ('+', '?', and character classes, for example), and others that change the power of regular expressions (like lookahead and back-references). These regular expressions are more powerful than DFAs, but it would be hard or impossible to build an XML parser with just these extended regular expressions.

Recognizing Context-Free Languages

Context-free languages are recognized by pushdown automata. These work just like DFAs, but with the addition of a stack. Pushdown automata select a transition using the first character of input and the value on the top of the stack. In each step, the machine consumes one input character and can push a value on the stack, pop one off, or do nothing with the stack.

Pushdown automata can use the stack to remember where they've been, which makes them suitable for parsing languages like XML (or most programming languages, with a few special exceptions).

Parsing XML

Parsers aren't built by designing a pushdown automaton, the same way you don't recognize regular languages by designing a DFA. Context-free grammars are a nicer way to describe a context-free language. They're typically written down in Backus-Naur Form (BNF). Here's a simple BNF grammar for a subset of XML:

Tags ::= Tag Tags | <nothing>

Tag ::= "<" /[a-zA-Z]+/ Attributes ">" Document "</" /[a-zA-Z]+/ ">"

Attributes ::= Attribute Attributes | <nothing>

Attribute ::= /[a-zA-Z]+/ "=" "\"" /[a-zA-Z0-9 ]+/ "\""

This grammar is made up of non-terminals ("Tags", "Tag", "Attributes", and "Attribute"). Anywhere a non-terminal shows up on the right side of a rule it can be replaced by any of the possible definitions (separated by |). The text in quotes and regular expressions are terminals, which must match the input exactly.

The Tag non-terminal recognizes the start and end tags, with a Tags non-terminal between them. Whenever the parser recognizes a start tag, it expects to find the closing tag on the other side. Tags will recognize one tag, followed by Tags again. This recursive definition allows the parser to recognize an unbounded number of tags.

Parser generators are tools that turn context-free grammars into pushdown automata to recognize the input language. This takes a lot of the complexity out of building a parser, although there are plenty of challenges in accurately specifying a grammar.

Other Methods for Parsing

You can write a parser without building the state machine by hand, or by writing a context-free grammar. Typically this is done either with a recursive-descent parser or a hand-crafted parser that uses regular expressions with some special knowledge about the language being parsed. Recursive descent parsers look a lot like context-free grammars, but have some severe performance problems and functional limitations. There are also parsing expression grammars (PEGs) which work like a hybrid of regular expressions and BNF grammars. There are great articles on all of these techniques on Wikipedia, and many tools available for building parsers of all sorts.

like image 87
ccurtsinger Avatar answered Nov 15 '22 08:11

ccurtsinger