I'm trying to extract the HTML code of a table from a webpage using BeautifulSoup. <pre class="prettyprint"><code><table class="facts_label" id="facts_table">...</table> </code></pre> I would like to know why the code bellow works with the <code>"html.parser"</code> and prints back <code>none</code> if I change <code>"html.parser"</code> for <code>"lxml"</code>. <pre class="prettyprint"><code>#! /usr/bin/python from bs4 import BeautifulSoup from urllib import urlopen webpage = urlopen('http://www.thewebpage.com') soup=BeautifulSoup(webpage, "html.parser") table = soup.find('table', {'class' : 'facts_label'}) print table </code></pre>

There is a special paragraph in <code>BeautifulSoup</code> documentation called Differences between parsers, it states that: <blockquote> Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. </blockquote> The differences become clear on non well-formed HTML documents. The moral is just that you should use the parser that works in your particular case. Also note that you should always explicitly specify which parser are you using. This would help you to avoid surprises when running the code on different machines or virtual environments.

Beautiful Soup and Table Scraping - lxml vs html parser

Tags:

python

html-parsing

beautifulsoup

web-scraping

lxml

I'm trying to extract the HTML code of a table from a webpage using BeautifulSoup.

<table class="facts_label" id="facts_table">...</table>

I would like to know why the code bellow works with the "html.parser" and prints back none if I change "html.parser" for "lxml".

#! /usr/bin/python

from bs4 import BeautifulSoup
from urllib import urlopen

webpage = urlopen('http://www.thewebpage.com')
soup=BeautifulSoup(webpage, "html.parser")
table = soup.find('table', {'class' : 'facts_label'})
print table

817

asked Sep 07 '14 20:09

LaGuille

1 Answers

There is a special paragraph in BeautifulSoup documentation called Differences between parsers, it states that:

Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers.

The differences become clear on non well-formed HTML documents.

The moral is just that you should use the parser that works in your particular case.

Also note that you should always explicitly specify which parser are you using. This would help you to avoid surprises when running the code on different machines or virtual environments.

194

answered Sep 19 '22 00:09

alecxe

Related questions
                            
                                ImportError: No module named mysite.settings (Django)
                            
                                Python Selenium - AttributeError : WebElement object has no attribute sendKeys
                            
                                Mapping column names to random forest feature importances
                            
                                How to sort integers alphabetically
                            
                                How to split each individual value between two string in Python
                            
                                Use walrus operator in Python 3.7
                            
                                string.split(text) or text.split() : what's the difference?
                            
                                Rules of thumb for when to use operator overloading in python
                            
                                Cannot make cProfile work in IPython
                            
                                How can I detect DOS line breaks in a file?
                            
                                Python regex, matching pattern over multiple lines.. why isn't this working?
                            
                                Python: Writing to Excel 2007+ files (.xlsx files)
                            
                                Is there a way to efficiently yield every file in a directory containing millions of files?
                            
                                The way to make namespace packages in Python
                            
                                lists or dicts over zeromq in python
                            
                                Python missing or unusable error while cross compiling GDB
                            
                                extract last two fields from split
                            
                                What's the difference between plus and append in python for list manipulation? [duplicate]
                            
                                Python: How to get the Content-Type of an URL?
                            
                                Active Django settings file from Celery worker

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With