Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing badly formed HTML files with XPATH

I inherited someone elses (dreadful) codebase, and am currently desperately trying to fix things. Today, that means gathering a list of all the dead links in our template/homepage.

I'm currently using ElementTree in Python, trying to parse the site using xpath. Unfortunately, it seems that the html is malformed, and ElementTree keeps throwing errors.

Are there more error friendly xpath parsers? Is there a way to run ElementTree in a non-strict mode? Are there any other methods, such as preprocessing, that can be used to help this process?

like image 381
MrGlass Avatar asked Mar 18 '26 07:03

MrGlass


2 Answers

LXML can parse some malformed HTML, implements an extended version of the ElementTree API, and supports XPath:

>>> from lxml import html
>>> t = html.fromstring("""<html><body>Hello! <p> Goodbye.</body></html""")
>>> html.tostring(t.xpath("//body")[0])
'<body>Hello! <p> Goodbye.</p></body>'
like image 107
Fred Foo Avatar answered Mar 19 '26 20:03

Fred Foo


My commiserations!

You'd be better off parsing your HTML with BeautifulSoup. As the homepage states:

You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.

and more importantly:

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

like image 23
Martijn Pieters Avatar answered Mar 19 '26 19:03

Martijn Pieters



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!