Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use CSS selectors to collect HTML elements from a streaming parser (e.g. SAX stream)

How to parse CSS (CSS3) selector and use it (in jQuery-like way) to collect HTML elements not from DOM (from tree structure), but from stream (e.g. SAX), i.e. using sequential access event based parser?

By the way, are there any CSS selectors (or their combination) that need access to DOM (Wikipedia SAX page says that XPath selectors "need to be able to access any node at any time in the parsed XML tree")?

I am most interested in implementing selector combinators, e.g. 'A B' descendant selector.

I prefer solutions describing algorithm, or in Perl (for HTML::Zoom).

like image 881
Jakub Narębski Avatar asked Jan 11 '11 11:01

Jakub Narębski


1 Answers

I would do it with regular expressions.

First, convert the selector into a regular expression that matches a simple top-to-bottom list of opening tags representing a given parser stack state. To explain, here are some simple selectors and their corresponding regexen:

  • A becomes /<A[^>]*>$/
  • A#someid becomes /<A[^>]*id="someid"[^>]*>$/
  • A.someclass becomes /<A[^>]*class="[^"]*(?<= |")someclass(?= |")[^"]*"[^>]*>$/
  • A > B becomes /<A[^>]*><B[^>]*>$/
  • A B becomes /<A[^>]*>(?:<[^>]*>)*<B[^>]*>$/

And so on. Note that the regular expressions all end with $, but do not start with ^; this corresponds with the way CSS selectors do not have to match from the root of the document. Also note that there is some lookbehind and lookahead stuff in the class matching code, which is necessary so that you don't accidentally match against "someclass-super-duper" when you want the quite distinct class "someclass".

If you need more examples, please let me know.

Once you've constructed the selector regex, you're ready to begin parsing. As you parse, maintain a stack of tags which currently apply; update this stack whenever you descend or ascend. To check for selector matching, convert that stack to a list of tags which can match the regular expression. For example, consider this document:

<x><a>Stuff goes here</a><y id="boo"><z class="bar">Content here</z></y></x>

Your stack state string would go through the following values in order as you enter each element:

  1. <x>
  2. <x><a>
  3. <x><y id="boo">
  4. <x><y id="boo"><z class="bar">

The matching process is simple: whenever the parser descends into a new element, update the state string and check if it matches the selector regex. If the regex matches, then the selector matches that element!

Issues to watch out for:

  • Double quotes inside attributes. To get around this, apply html entity encoding to attribute values when creating the regex, and to attribute values when creating the stack state string.

  • Attribute order. When building both the regex and the state string, use some canonical order for the attributes (alphabetical is easiest). Otherwise, you might find that your regex for the selector a#someid.someclass which expects <a id="someid" class="someclass"> unfortunately fails when your parser goes into <a class="someclass" id="someid">.

  • Case sensitivity. According to the HTML spec, the class and id attributes match case sensitively (notice the 'CS' marker on the corresponding sections). So, you must use case-sensitive regex matching. However, in HTML, element names are not case sensitive, although they are in XML. If you want HTML-like case-insensitive element name matching, then canonicalize element names to either upper case or lower case in both the selector regex and the state stack string.

  • Additional magic is necessary to deal with the selector patterns that involve presence or absence of element siblings, namely A:first-child and A + B. You might accomplish these by adding a special attribute to the tag containing the name of the tag immediately prior, or "" if this tag is the first child. There's also the general sibling selector, A ~ B; I'm not quite sure how to deal with that one.

EDIT: If you dislike regular expression hackery, you can still use this approach to solve the problem, only using your own state machine instead of the regex engine. Specifically, a CSS selector can be implemented as a nondeterministic finite state machine, which is an intimidating-sounding term, but just means the following in practical terms:

  1. There might be more than one possible transition from any given state
  2. The machine tries one of them, and if that doesn't work out, then it backtracks and tries the other
  3. The easiest way to implement this is to keep a stack for the machine, which you push onto whenever you follow a path and pop from whenever you need to backtrack. It comes down to the same sort of thing you'd use to do a depth-first search.

The secret behind nearly all of the awesomeness of regular expressions is in its use of this style of state machine.

like image 166
DSimon Avatar answered Sep 18 '22 12:09

DSimon