Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Screen scraping: regular expressions or XQuery expressions?

I was answering some quiz questions for an interview, and the question was about how would I do screen scraping. That is, picking content out of a web page, assuming you don't have a better structured way to query the information directly (e.g. a web service).

My solution was to use an XQuery expression. The expression was fairly long because the content I needed was pretty deep in the HTML hierarchy. I had to search up through the ancestors a fair way before I found an element with an id attribute. For example, scraping an Amazon.com page for Product Dimensions looks like this:

//a[@id="productDetails"]
/following-sibling::table
//h2[contains(child::text(), "Product Details")]
/following-sibling::div
//li
/b[contains(child::text(), "Product Dimensions:")]
/following-sibling::text()

That's a pretty nasty expression, but that's why Amazon provides a web service API. Anyway, it's just one example. The question was not about Amazon, it's about screen scraping.

The interviewer didn't like my solution. He thought it was fragile, because a change to the page design by Amazon could require rewriting the XQuery expression. Debugging an XQuery expression that doesn't match anything in the page it's applied against is hard.

I did not disagree with his statements, but I didn't think his solution was any improvement: he thought it's better to use a regular expression, and search for content and markup near the shipping weight. For example, using Perl:

$html =~ m{<li>\s*<b>\s*Product Dimensions:\s*</b>\s*(.*?)</li>}s;

My counter-argument was that this is also susceptible to Amazon changing their HTML code. They could spell HTML tags in capitals (<LI>), or add CSS attributes or change <b> to <span> or change the label "Product Dimensions:" to "Dimensions:" or many other kinds of changes. My point was that regular expressions don't solve the weaknesses he called out in my XQuery solution.

But in addition, regular expressions can find false positives, unless you add enough context to the expression. It can also unintentionally match content that happens to be inside a comment, or an attribute string, or a CDATA section.

My question is, what technology do you use to do screen scraping? Why did you choose that solution? Is there some compelling reason to use one? Or never use the other? Is there a third choice besides those I showed above?

PS: Assume for the sake of argument that there is no web service API or other more direct way to acquire the desired content.

like image 445
Bill Karwin Avatar asked Mar 14 '09 18:03

Bill Karwin


2 Answers

I'd use a regular expression, but only because most HTML pages are not valid XML, so you'd never get the XQUERY to work.

I don't know XQuery, but that looks like an XPATH expression to me. If so, it looks a bit expensive with so many "//" operators in it.

like image 186
John Saunders Avatar answered Sep 23 '22 05:09

John Saunders


I'd use a regular expression, for the reasons the manager gave, pluss a few (more portable, easier for outside programmers to follow, etc).

Your counter argument misses the point that his solution was fragile with regard to local changes while yours is fragile with regard to global changes. Anything that breaks his will probably break yours, but not visa-versa.

Finally, it's a lot easier to build slop / flex into his solution (if, for example, you have to deal with multiple minor variations in the input).

like image 42
MarkusQ Avatar answered Sep 24 '22 05:09

MarkusQ