Win32.: How to scrape HTML without regular expressions?

Tags:

A recent blog entry by a Jeff Atwood says that you should never parse HTML using regular expressions - yet doesn't give an alternative.

I want to scrape search search results, extracting values:

<div class="used_result_container"> 
   ...
      ...
         <div class="vehicleInfo"> 
            ...
               ...
                  <div class="makemodeltrim">
                     ...
                     <a class="carlink" href="[Url]">[MakeAndModel]</a>
                     ...
                  </div> 
                  <div class="kilometers">[Kilometers]</div> 
                  <div class="price">[Price]</div> 
                  <div class="location">
                     <span class='locationText'>Location:</span>[Location]
                  </div> 
               ...          
            ...
         </div> 
      ...
   ...
</div> 

...and it repeats

You can see the values I want to extract, [enclosed in brackets]:

Url
MakeAndModel
Kilometers
Price
Location

Assuming we accept the premise that parsing HTML:

generally a bad idea
rapidly devolves into madness

What's the way to do it?

Assumptions:

native Win32
loose html

Assumption clarifications:

Native Win32

.NET/CLR is not native Win32
Java is not native Win32
perl, python, ruby are not native Win32
assume C++, in Visual Studio 2000, compiled into a native Win32 application

Native Win32 applications can call library code:

copied source code
DLLs containing function entry points
DLLs containing COM objects
DLLs containing COM objects that are COM-callable wrappers (CCW) around managed .NET objects

Loose HTML

xml is not loose HTML
xhtml is not loose HTML
strict HTML is not loose HTML

Loose HTML implies that the HTML is not well-formed xml (strict HTML is not well-formed xml anyway), and so an XML parser cannot be used. In reality I was present the assumption that any HTML parser must be generous in the HTML it accepts.

Clarification#2

Assuming you like the idea of turning the HTML into a Document Object Model (DOM), how then do you access repeating structures of data? How would you walk a DOM tree? I need a DIV node that is a class of used_result_container, which has a child DIV of class of vehicleInfo. But the nodes don't necessarily have to be direct children of one another.

It sounds like I'm trading one set of regular expression problems for another. If they change the structure of the HTML, I will have to re-write my code to match - as I would with regular expressions. And assuming we want to avoid those problems, because those are the problems with regular expressions, what do I do instead?

And would I not be writing a regular expression parser for DOM nodes? i'm writing an engine to parse a string of objects, using an internal state machine and forward and back capture. No, there must be a better way - the way that Jeff alluded to.

I intentionally kept the original question vague, so as not to lead people down the wrong path. I didn't want to imply that the solution, necessarily, had anything to do with:

walking a DOM tree
xpath queries

Clarification#3

The sample HTML I provided I trimmed down to the important elements and attributes. The mechanism I used to trim the HTML down was based on my internal bias that uses regular expressions. I naturally think that I need various "sign-posts in the HTML that I look for.

So don't confuse the presented HTML for the entire HTML. Perhaps some other solution depends on the presence of all the original HTML.

Update 4

The only proposed solutions seem to involve using a library to convert the HTML into a Document Object Model (DOM). The question then would have to become: then what?

Now that I have the DOM, what do I do with it? It seems that I still have to walk the tree with some sort of regular DOM expression parser, capable of forward matching and capture.

In this particular case i need all the used_result_container DIV nodes which contain vehicleInfo DIV nodes as children. Any used_result_container DIV nodes that do not contain vehicleInfo has a child are not relevant.

Is there a DOM regular expression parser with capture and forward matching? I don't think XPath can select higher level nodes based on criteria of lower level nodes:

\\div[@class="used_result_container" && .\div[@class="vehicleInfo"]]\*

Note: I use XPath so infrequently that I cannot make up hypothetical xpath syntax very goodly.

933

asked Nov 24 '09 14:11

Ian Boyd

2 Answers

Native Win32

You can always use IHtmlDocument2. This is built-in to Windows at this point. With this COM interface, you get native access to a powerful DOM parser (IE's DOM parser!).

answered Oct 24 '22 01:10

Frank Krueger

Python:

lxml - faster, perhaps better at parsing bad HTML

BeautifulSoup - if lxml fails on your input try this.

Ruby: (heard of the following libraries, but never tried them)

Nokogiri

hpricot

Though if your parsers choke, and you can roughly pinpoint what is causing the choking, I frankly think it's okay to use a regex hack to remove that portion before passing it to the parser.

If you do decide on using lxml, here are some XPath tutorials that you may find useful. The lxml tutorials kind of assume that you know what XPath is (which I didn't when I first read them.)

Edit: Your post has really grown since it first came out... I'll try to answer what I can.

i don't think XPath can select higher level nodes based on criteria of lower level nodes:

It can. Try //div[@class='vehicleInfo']/parent::div[@class='used_result_container']. Use ancestor if you need to go up more levels. lxml also provides a getparent() method on its search results, and you could use that too. Really, you should look at the XPath sites I linked; you can probably solve your problems from there.

how then do you access repeating structures of data?

It would seem that DOM queries are exactly suited to your needs. XPath queries return you a list of the elements found -- what more could you want? And despite its name, lxml does accept 'loose HTML'. Moreover, the parser recognizes the 'sign-posts' in the HTML and structures the whole document accordingly, so you don't have to do it yourself.

Yes, you are still have to do a search on the structure, but at a higher level of abstraction. If the site designers decide to do a page overhaul and completely change the names and structure of their divs, then that's too bad, you have to rewrite your queries, but it should take less time than rewriting your regex. Nothing will do it automatically for you, unless you want to write some AI capabilities into your page-scraper...

I apologize for not providing 'native Win32' libraries, I'd assumed at first that you simply meant 'runs on Windows'. But the others have answered that part.

179

answered Oct 24 '22 01:10

int3

Related questions
                            
                                What are the advantages of using data- rather than x- prefix for custom attributes?
                            
                                Responsive background image in div full width
                            
                                Unminify / Decompress JavaScript
                            
                                Modifying a webpage so it's more mobile/tablet compatible
                            
                                How to change tab name in browser when user goes off from my site
                            
                                How to do auto-width with HTML IFrame
                            
                                Image expanding larger than parent div
                            
                                Put div below navigation bar and don't overlap content
                            
                                What is the correct way to make a Material Design Lite table 100% width?
                            
                                Does html5 local storage store per iframe?
                            
                                What is the difference between jQuery change and onchange of HTML?
                            
                                Html ordered list ol, add space between number and text
                            
                                Use plotly offline to generate graphs as images
                            
                                Difference between href and data-href in anchor tag in html
                            
                                Why can Haskell not handle characters from a specific website?
                            
                                jQuery appends as text instead of html
                            
                                Uncheck radio buttons in react
                            
                                How to wrap text inside bootstrap button without changing button size?
                            
                                How to put in text when using XElement
                            
                                .NET HTML DOM Parser? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Win32.: How to scrape HTML without regular expressions?

Tags:

html

regex

windows

winapi

screen-scraping