difference between lxml and html5lib in the context of beautifulsoup

Tags:

Is there a difference between the capabiities of lxml and html5lib parsers in the context of beautifulsoup? I am trying to learn to use BS4 and using the following code construct --

ret = requests.get('http://www.olivegarden.com')
soup = BeautifulSoup(ret.text, 'html5lib')
for item in soup.find_all('a'): 
    print item['href']

I started out with using lxml as the parser but noticed that for some websites the for loop just is never entered even though there are valid links in the page. The same page works with html5ib parser. Are there any specific type of pages that might not work with lxml?

I am on Ubuntu using python-lxml 2.3.2-1 with libxml2 2.7.8.dfsg-5.1ubunt and html5lib-1.0b3

EDIT: I updated to lxml 3.1.2 and still see the same issue. On a mac though running 3.0.x the same page is being parsed properly. The website in question is www.olivegarden.com

563

asked Sep 03 '13 00:09

R11

1 Answers

html5lib uses the HTML parsing algorithm as defined in the HTML spec, and as implemented in all major browsers. lxml uses libxml2's HTML parser — this is based on their XML parser, ultimately, and does not follow any error handling for invalid HTML used anywhere else.

Most web developers only test with web browsers — standards be damned — so if you want to get what the page's author intended, you'll likely need to use something like html5lib that matches current browsers,

146

answered Oct 15 '22 08:10

gsnedders

Related questions
                            
                                How can I distribute a Python program without requiring users to have a Python runtime? [duplicate]
                            
                                Using pySerial with Python 3.3
                            
                                Sharing an object between Gunicorn workers, or persisting an object within a worker
                            
                                Is there a way to write formatted text from Python?
                            
                                Matplotlib: remove warning about matplotlib.use()
                            
                                lxml.html parsing with XPath and variables
                            
                                Python 2.7 on Windows, "assert main_name not in sys.modules, main_name" for all multiprocessing examples
                            
                                rsync over ssh - using channel created by Paramiko in Python
                            
                                Auto indent doesn't work when using vim coding python
                            
                                Byte limit when transferring Python objects between Processes using a Pipe?
                            
                                Can push notifications be done with an AngularJS+Flask stack?
                            
                                Does Pandas support quarterly dates of the form yyyyQp (e.g. 2013Q2)?
                            
                                Check whether modification in re.sub occurred
                            
                                How to use queue with concurrent future ThreadPoolExecutor in python 3?
                            
                                Relevance Vector Machine [closed]
                            
                                How can I change the text of Listbox item?
                            
                                scipy.optimize dll load failure on Windows 8
                            
                                Embedded Documents issue with MongoEngine
                            
                                Fast relational database for simple use with Python [closed]
                            
                                Always run a constant number of subprocesses in parallel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

difference between lxml and html5lib in the context of beautifulsoup

Tags:

python

beautifulsoup

lxml

html5lib

R11

People also ask

1 Answers

gsnedders

Recent Activity

Donate For Us