Need help parsing html in python3, not well formed enough for xml.etree.ElementTree

Question

I keep getting mismatched tag errors all over the place. I'm not sure why exactly, it's the text on craigslist homepage which looks fine to me, but I haven't skimmed it thoroughly enough. Is there perhaps something more forgiving I could use or is this my best bet for html parsing with the standard library?

Ira Baxter · Accepted Answer

The mismatched tag errors are likely caused by mismatched tags. Browsers are famous for accepting sloppy html, and have made it easy for web page coders to write badly formed html, so there's a lot of it. THere's no reason to believe that creagslist should be immune to bad web page designers.

You need to use a grammar that allows for these mismatches. If the parser you are using won't let you redefine the grammar appropriately, you are stuck. (There may be a better Python library for this, but I don't know it).

One alternative is to run the web page through a tool like Tidy that cleans up such mismatches, and then run your parser on that.

Matt Joiner · Answer

The best library for parsing unpredictable HTML is BeautifulSoup. Here's a quote from the project page:

You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does this parser.

However it isn't well-supported for Python 3, there's more information about this at the end of the link.

Need help parsing html in python3, not well formed enough for xml.etree.ElementTree

Tags:

python

python-3.x

parsing

xml.etree

kryptobs2000

2 Answers

Ira Baxter

Matt Joiner

Recent Activity

Donate For Us

Need help parsing html in python3, not well formed enough for xml.etree.ElementTree

Tags:

python

python-3.x

parsing

xml.etree

kryptobs2000

2 Answers

Ira Baxter

Matt Joiner

Related questions

Recent Activity

Donate For Us