I'm trying to parse some html in Python. There were some methods that actually worked before... but nowadays there's nothing I can actually use without workarounds. <ul> <li>beautifulsoup has problems after SGMLParser went away</li> <li>html5lib cannot parse half of what's "out there"</li> <li>lxml is trying to be "too correct" for typical html (attributes and tags cannot contain unknown namespaces, or an exception is thrown, which means almost no page with Facebook connect can be parsed)</li> </ul> What other options are there these days? (if they support xpath, that would be great)

Make sure that you use the <code>html</code> module when you parse HTML with <code>lxml</code>: <pre class="prettyprint"><code>>>> from lxml import html >>> doc = """<html> ... <head> ... <title> Meh ... </head> ... <body> ... Look at this interesting use of ... rather than using tags as line breaks ... </body>""" >>> html.document_fromstring(doc) <Element html at ...> </code></pre> All the errors & exceptions will melt away, you'll be left with an amazingly fast parser that often deals with HTML soup better than BeautifulSoup.

I've used pyparsing for a number of HTML page scraping projects. It is a sort of middle-ground between BeautifulSoup and the full HTML parsers on one end, and the too-low-level approach of regular expressions (that way lies madness). With pyparsing, you can often get good HTML scraping results by identifying the specific subset of the page or data that you are trying to extract. This approach avoids the issues of trying to parse everything on the page, since some problematic HTML outside of your region of interest could throw off a comprehensive HTML parser. While this sounds like just a glorified regex approach, pyparsing offers builtins for working with HTML- or XML-tagged text. Pyparsing avoids many of the pitfalls that frustrate the regex-based solutions: <ul> <li>accepts whitespace without littering '\s*' all over your expression</li> <li>handles unexpected attributes within tags</li> <li>handles attributes in any order</li> <li>handles upper/lower case in tags</li> <li>handles attribute names with namespaces</li> <li>handles attribute values in double quotes, single quotes, or no quotes</li> <li>handles empty tags (those of the form <code><blah /></code>)</li> <li>returns parsed tag data with object-attribute access to tag attributes</li> </ul> Here's a simple example from the pyparsing wiki that gets <code><a href=xxx></code> tags from a web page: <pre class="prettyprint"><code>from pyparsing import makeHTMLTags, SkipTo # read HTML from a web page page = urllib.urlopen( "http://www.yahoo.com" ) htmlText = page.read() page.close() # define pyparsing expression to search for within HTML anchorStart,anchorEnd = makeHTMLTags("a") anchor = anchorStart + SkipTo(anchorEnd).setResultsName("body") + anchorEnd for tokens,start,end in anchor.scanString(htmlText): print tokens.body,'->',tokens.href </code></pre> This will pull out the <code><a></code> tags, even if there are other portions of the page containing problematic HTML. There are other HTML examples at the pyparsing wiki: <ul> <li>http://pyparsing.wikispaces.com/file/view/makeHTMLTagExample.py</li> <li>http://pyparsing.wikispaces.com/file/view/getNTPserversNew.py</li> <li>http://pyparsing.wikispaces.com/file/view/htmlStripper.py</li> <li>http://pyparsing.wikispaces.com/file/view/withAttribute.py</li> </ul> Pyparsing is not a total foolproof solution to this problem, but by exposing the parsing process to you, you can better control which pieces of the HTML you are specifically interested in, process them, and skip the rest.

Python html parsing that actually works

2 Answers

Make sure that you use the html module when you parse HTML with lxml:

>>> from lxml import html
>>> doc = """<html>
... <head>
...   <title> Meh
... </head>
... <body>
... Look at this interesting use of <p>
... rather than using <br /> tags as line breaks <p>
... </body>"""
>>> html.document_fromstring(doc)
<Element html at ...>

All the errors & exceptions will melt away, you'll be left with an amazingly fast parser that often deals with HTML soup better than BeautifulSoup.

119

answered Sep 20 '22 13:09

Tim McNamara

I've used pyparsing for a number of HTML page scraping projects. It is a sort of middle-ground between BeautifulSoup and the full HTML parsers on one end, and the too-low-level approach of regular expressions (that way lies madness).

With pyparsing, you can often get good HTML scraping results by identifying the specific subset of the page or data that you are trying to extract. This approach avoids the issues of trying to parse everything on the page, since some problematic HTML outside of your region of interest could throw off a comprehensive HTML parser.

While this sounds like just a glorified regex approach, pyparsing offers builtins for working with HTML- or XML-tagged text. Pyparsing avoids many of the pitfalls that frustrate the regex-based solutions:

accepts whitespace without littering '\s*' all over your expression
handles unexpected attributes within tags
handles attributes in any order
handles upper/lower case in tags
handles attribute names with namespaces
handles attribute values in double quotes, single quotes, or no quotes
handles empty tags (those of the form <blah />)
returns parsed tag data with object-attribute access to tag attributes

Here's a simple example from the pyparsing wiki that gets <a href=xxx> tags from a web page:

from pyparsing import makeHTMLTags, SkipTo

# read HTML from a web page
page = urllib.urlopen( "http://www.yahoo.com" )
htmlText = page.read()
page.close()

# define pyparsing expression to search for within HTML    
anchorStart,anchorEnd = makeHTMLTags("a")
anchor = anchorStart + SkipTo(anchorEnd).setResultsName("body") + anchorEnd

for tokens,start,end in anchor.scanString(htmlText):
    print tokens.body,'->',tokens.href

This will pull out the <a> tags, even if there are other portions of the page containing problematic HTML. There are other HTML examples at the pyparsing wiki:

http://pyparsing.wikispaces.com/file/view/makeHTMLTagExample.py
http://pyparsing.wikispaces.com/file/view/getNTPserversNew.py
http://pyparsing.wikispaces.com/file/view/htmlStripper.py
http://pyparsing.wikispaces.com/file/view/withAttribute.py

Pyparsing is not a total foolproof solution to this problem, but by exposing the parsing process to you, you can better control which pieces of the HTML you are specifically interested in, process them, and skip the rest.

answered Sep 18 '22 13:09

PaulMcG

Related questions
                            
                                How to calculate percentage with Pandas' DataFrame
                            
                                Pyplot: using percentage on x axis
                            
                                Nginx Django and Gunicorn. Gunicorn sock file is missing?
                            
                                How do I use within / in operator in a Pandas DataFrame? [duplicate]
                            
                                Install gdal using conda?
                            
                                Calculating cumulative returns with pandas dataframe
                            
                                Pandas Counting Unique Rows
                            
                                Splitting a list into uneven groups?
                            
                                How to measure the speed of a python function
                            
                                Creating a Gin Index with Trigram (gin_trgm_ops) in Django model
                            
                                How to re-partition pyspark dataframe?
                            
                                KeyError when loading pickled scikit-learn model using joblib
                            
                                Why can't I get reproducible results in Keras even though I set the random seeds?
                            
                                Create random shape/contour using matplotlib
                            
                                How to fix AttributeError: 'Series' object has no attribute 'to_numpy'
                            
                                Why are SQL aggregate functions so much slower than Python and Java (or Poor Man's OLAP)
                            
                                Python Threads - Critical Section
                            
                                The Assignment Problem, a NumPy function?
                            
                                Path separator char in python 2.4
                            
                                Method to peek at a Python program running right now

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python html parsing that actually works

Tags:

python

html

parsing

viraptor

People also ask

2 Answers

Tim McNamara

PaulMcG

Recent Activity

Donate For Us