Parse HTML via XPath [closed]

Question

In .Net, I found this great library, HtmlAgilityPack that allows you to easily parse non-well-formed HTML using XPath. I've used this for a couple years in my .Net sites, but I've had to settle for more painful libraries for my Python, Ruby and other projects. Is anyone aware of similar libraries for other languages?

Jagtesh Chadha · Accepted Answer

I'm surprised there isn't a single mention of lxml. It's blazingly fast and will work in any environment that allows CPython libraries.

Here's how you can parse HTML via XPATH using lxml.

>>> from lxml import etree
>>> doc = '<foo><bar></bar></foo>'
>>> tree = etree.HTML(doc)

>>> r = tree.xpath('/foo/bar')
>>> len(r)
1
>>> r[0].tag
'bar'

>>> r = tree.xpath('bar')
>>> r[0].tag
'bar'

Aaron Maenpaa · Answer

In python, ElementTidy parses tag soup and produces an element tree, which allows querying using XPath:

>>> from elementtidy.TidyHTMLTreeBuilder import TidyHTMLTreeBuilder as TB
>>> tb = TB()
>>> tb.feed("<p>Hello world")
>>> e= tb.close()
>>> e.find(".//{http://www.w3.org/1999/xhtml}p")
<Element {http://www.w3.org/1999/xhtml}p at 264eb8>

Gareth Davidson · Answer

The most stable results I've had have been using lxml.html's soupparser. You'll need to install python-lxml and python-beautifulsoup, then you can do the following:

from lxml.html.soupparser import fromstring
tree = fromstring('<mal form="ed"><html/>here!')
matches = tree.xpath("./mal[@form=ed]")

Parse HTML via XPath [closed]

Tags:

python

html

parsing

ruby

xpath

Tristan Havelick

3 Answers

Jagtesh Chadha

Aaron Maenpaa

Gareth Davidson

Recent Activity

Donate For Us

Parse HTML via XPath [closed]

Tags:

python

html

parsing

ruby

xpath

Tristan Havelick

3 Answers

Jagtesh Chadha

Aaron Maenpaa

Gareth Davidson

Related questions

Recent Activity

Donate For Us