Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML via XPath [closed]

In .Net, I found this great library, HtmlAgilityPack that allows you to easily parse non-well-formed HTML using XPath. I've used this for a couple years in my .Net sites, but I've had to settle for more painful libraries for my Python, Ruby and other projects. Is anyone aware of similar libraries for other languages?

like image 991
Tristan Havelick Avatar asked Nov 13 '08 01:11

Tristan Havelick


3 Answers

I'm surprised there isn't a single mention of lxml. It's blazingly fast and will work in any environment that allows CPython libraries.

Here's how you can parse HTML via XPATH using lxml.

>>> from lxml import etree
>>> doc = '<foo><bar></bar></foo>'
>>> tree = etree.HTML(doc)

>>> r = tree.xpath('/foo/bar')
>>> len(r)
1
>>> r[0].tag
'bar'

>>> r = tree.xpath('bar')
>>> r[0].tag
'bar'
like image 159
Jagtesh Chadha Avatar answered Nov 09 '22 20:11

Jagtesh Chadha


In python, ElementTidy parses tag soup and produces an element tree, which allows querying using XPath:

>>> from elementtidy.TidyHTMLTreeBuilder import TidyHTMLTreeBuilder as TB
>>> tb = TB()
>>> tb.feed("<p>Hello world")
>>> e= tb.close()
>>> e.find(".//{http://www.w3.org/1999/xhtml}p")
<Element {http://www.w3.org/1999/xhtml}p at 264eb8>
like image 30
Aaron Maenpaa Avatar answered Nov 09 '22 18:11

Aaron Maenpaa


The most stable results I've had have been using lxml.html's soupparser. You'll need to install python-lxml and python-beautifulsoup, then you can do the following:

from lxml.html.soupparser import fromstring
tree = fromstring('<mal form="ed"><html/>here!')
matches = tree.xpath("./mal[@form=ed]")
like image 6
Gareth Davidson Avatar answered Nov 09 '22 20:11

Gareth Davidson