Processing badly formed HTML files with XPATH

Question

I inherited someone elses (dreadful) codebase, and am currently desperately trying to fix things. Today, that means gathering a list of all the dead links in our template/homepage.

I'm currently using ElementTree in Python, trying to parse the site using xpath. Unfortunately, it seems that the html is malformed, and ElementTree keeps throwing errors.

Are there more error friendly xpath parsers? Is there a way to run ElementTree in a non-strict mode? Are there any other methods, such as preprocessing, that can be used to help this process?

Fred Foo · Accepted Answer

LXML can parse some malformed HTML, implements an extended version of the ElementTree API, and supports XPath:

>>> from lxml import html
>>> t = html.fromstring("""<html><body>Hello! <p> Goodbye.</body></html""")
>>> html.tostring(t.xpath("//body")[0])
'<body>Hello! <p> Goodbye.</p></body>'

Martijn Pieters · Answer

My commiserations!

You'd be better off parsing your HTML with BeautifulSoup. As the homepage states:

You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.

and more importantly:

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

Processing badly formed HTML files with XPATH

Tags:

python

html

xml

xpath

elementtree

MrGlass

2 Answers

Fred Foo

Martijn Pieters

Recent Activity

Donate For Us

Processing badly formed HTML files with XPATH

Tags:

python

html

xml

xpath

elementtree

MrGlass

2 Answers

Fred Foo

Martijn Pieters

Related questions

Recent Activity

Donate For Us