In .Net, I found this great library, HtmlAgilityPack that allows you to easily parse non-well-formed HTML using XPath. I've used this for a couple years in my .Net sites, but I've had to settle for more painful libraries for my Python, Ruby and other projects. Is anyone aware of similar libraries for other languages?
I'm surprised there isn't a single mention of lxml. It's blazingly fast and will work in any environment that allows CPython libraries.
Here's how you can parse HTML via XPATH using lxml.
>>> from lxml import etree
>>> doc = '<foo><bar></bar></foo>'
>>> tree = etree.HTML(doc)
>>> r = tree.xpath('/foo/bar')
>>> len(r)
1
>>> r[0].tag
'bar'
>>> r = tree.xpath('bar')
>>> r[0].tag
'bar'
In python, ElementTidy parses tag soup and produces an element tree, which allows querying using XPath:
>>> from elementtidy.TidyHTMLTreeBuilder import TidyHTMLTreeBuilder as TB
>>> tb = TB()
>>> tb.feed("<p>Hello world")
>>> e= tb.close()
>>> e.find(".//{http://www.w3.org/1999/xhtml}p")
<Element {http://www.w3.org/1999/xhtml}p at 264eb8>
The most stable results I've had have been using lxml.html's soupparser. You'll need to install python-lxml and python-beautifulsoup, then you can do the following:
from lxml.html.soupparser import fromstring
tree = fromstring('<mal form="ed"><html/>here!')
matches = tree.xpath("./mal[@form=ed]")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With