Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML with Python 2.7 - HTMLParser, SGMLParser, or Beautiful Soup?

I want to do some screen-scraping with Python 2.7, and I have no context for the differences between HTMLParser, SGMLParser, or Beautiful Soup.

Are these all trying to solve the same problem, or do they exist for different reasons? Which is simplest, which is most robust, and which (if any) is the default choice?

Also, please let me know if I have overlooked a significant option.

Edit: I should mention that I'm not particularly experienced in HTML parsing, and I'm particularly interested in which will get me moving the quickest, with the goal of parsing HTML on one particular site.

like image 942
Eric Wilson Avatar asked Jun 27 '11 14:06

Eric Wilson


1 Answers

I am using and would recommend lxml and pyquery for parsing HTML. I had to write a web scraping bot a few month ago and of all the popular alternatives I tried, including HTMLParser and BeautifulSoup, I went with lxml and the syntax sugar of pyquery. I haven't tried SGMLParser though.

For what I've seen, lxml is more or less the most feature-rich library and its underlying C core is quite performant when compared to its alternatives. As for pyquery, I really liked its jQuery-inspired syntax which makes navigating the DOM more enjoyable.

Here are some resources you might find useful in case you decide to give it a try:

  • lxml home page
  • pyquery home page
  • BeautifulSoup vs lxml benchmark
  • Win installer for pyquery built against Python 2.7 - I had a hard time setting up pyquery :)

Well, that's my 2c :) I hope this helps.

like image 179
tishon Avatar answered Sep 25 '22 16:09

tishon