Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

difference between lxml and html5lib in the context of beautifulsoup

Is there a difference between the capabiities of lxml and html5lib parsers in the context of beautifulsoup? I am trying to learn to use BS4 and using the following code construct --

ret = requests.get('http://www.olivegarden.com')
soup = BeautifulSoup(ret.text, 'html5lib')
for item in soup.find_all('a'): 
    print item['href']

I started out with using lxml as the parser but noticed that for some websites the for loop just is never entered even though there are valid links in the page. The same page works with html5ib parser. Are there any specific type of pages that might not work with lxml?

I am on Ubuntu using python-lxml 2.3.2-1 with libxml2 2.7.8.dfsg-5.1ubunt and html5lib-1.0b3

EDIT: I updated to lxml 3.1.2 and still see the same issue. On a mac though running 3.0.x the same page is being parsed properly. The website in question is www.olivegarden.com

like image 563
R11 Avatar asked Sep 03 '13 00:09

R11


People also ask

What does lxml do in BeautifulSoup?

To prevent users from having to choose their parser library in advance, lxml can interface to the parsing capabilities of BeautifulSoup through the lxml. html. soupparser module. It provides three main functions: fromstring() and parse() to parse a string or file using BeautifulSoup into an lxml.

What is difference between HTML parser and html5lib?

html. parser - built-in - no extra dependencies needed. html5lib - the most lenient - better use it if HTML is broken. lxml - the fastest.

What is difference between HTML parser and lxml?

lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib. Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.

How do I use lxml with BeautifulSoup in Python?

To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .


1 Answers

html5lib uses the HTML parsing algorithm as defined in the HTML spec, and as implemented in all major browsers. lxml uses libxml2's HTML parser — this is based on their XML parser, ultimately, and does not follow any error handling for invalid HTML used anywhere else.

Most web developers only test with web browsers — standards be damned — so if you want to get what the page's author intended, you'll likely need to use something like html5lib that matches current browsers,

like image 146
gsnedders Avatar answered Oct 15 '22 08:10

gsnedders