I'm trying to parse some html in Python. There were some methods that actually worked before... but nowadays there's nothing I can actually use without workarounds.
What other options are there these days? (if they support xpath, that would be great)
Beautiful Soup (bs4) is a Python library that is used to parse information out of HTML or XML files. It parses its input into an object on which you can run a variety of searches. To start parsing an HTML file, import the Beautiful Soup library and create a Beautiful Soup object as shown in the following code example.
html5lib: A pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
Make sure that you use the html
module when you parse HTML with lxml
:
>>> from lxml import html
>>> doc = """<html>
... <head>
... <title> Meh
... </head>
... <body>
... Look at this interesting use of <p>
... rather than using <br /> tags as line breaks <p>
... </body>"""
>>> html.document_fromstring(doc)
<Element html at ...>
All the errors & exceptions will melt away, you'll be left with an amazingly fast parser that often deals with HTML soup better than BeautifulSoup.
I've used pyparsing for a number of HTML page scraping projects. It is a sort of middle-ground between BeautifulSoup and the full HTML parsers on one end, and the too-low-level approach of regular expressions (that way lies madness).
With pyparsing, you can often get good HTML scraping results by identifying the specific subset of the page or data that you are trying to extract. This approach avoids the issues of trying to parse everything on the page, since some problematic HTML outside of your region of interest could throw off a comprehensive HTML parser.
While this sounds like just a glorified regex approach, pyparsing offers builtins for working with HTML- or XML-tagged text. Pyparsing avoids many of the pitfalls that frustrate the regex-based solutions:
<blah />
)Here's a simple example from the pyparsing wiki that gets <a href=xxx>
tags from a web page:
from pyparsing import makeHTMLTags, SkipTo
# read HTML from a web page
page = urllib.urlopen( "http://www.yahoo.com" )
htmlText = page.read()
page.close()
# define pyparsing expression to search for within HTML
anchorStart,anchorEnd = makeHTMLTags("a")
anchor = anchorStart + SkipTo(anchorEnd).setResultsName("body") + anchorEnd
for tokens,start,end in anchor.scanString(htmlText):
print tokens.body,'->',tokens.href
This will pull out the <a>
tags, even if there are other portions of the page containing problematic HTML. There are other HTML examples at the pyparsing wiki:
Pyparsing is not a total foolproof solution to this problem, but by exposing the parsing process to you, you can better control which pieces of the HTML you are specifically interested in, process them, and skip the rest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With