I've read many good things about BeautifulSoup, that's why I'm trying to use it currently to scrape a set of websites with badly formed HTML. Unfortunately, there's one feature of BeautifulSoup that pretty much is a showstopper currently: It seems that when BeautifulSoup encounters a closing tag (in my case <code></code>) that was never opened, it decides to end the document instead. Also, the <code>find</code> method seems to not search the contents behind the (self-induced) <code></html></code> tag in this case. This means that when the block I'm interested in happens to be behind a spurious closing tag, I can't access the contents. Is there a way I can configure BeautifulSoup to ignore unmatched closing tags rather than closing the document when they are encountered?

BeautifulSoup doesn't do any parsing, it uses the output of a dedicated parser (<code>lxml</code> or <code>html.parser</code> or <code>html5lib</code>). Pick a different parser if the one you are using right now doesn't handle broken HTML quite the way you want it to. <code>lxml</code> is the faster parser and can handle broken HTML quite well, <code>html5lib</code> comes closest to how your browser would parse broken HTML but is a lot slower. Also see Installing a parser in the BeautifulSoup documentation, as well as the Differences between parsers section.

BeautifulSoup: how to ignore spurious end tags

Tags:

python

html

python-3.x

beautifulsoup

I've read many good things about BeautifulSoup, that's why I'm trying to use it currently to scrape a set of websites with badly formed HTML.

Unfortunately, there's one feature of BeautifulSoup that pretty much is a showstopper currently:

It seems that when BeautifulSoup encounters a closing tag (in my case ) that was never opened, it decides to end the document instead. Also, the find method seems to not search the contents behind the (self-induced) </html> tag in this case. This means that when the block I'm interested in happens to be behind a spurious closing tag, I can't access the contents.

Is there a way I can configure BeautifulSoup to ignore unmatched closing tags rather than closing the document when they are encountered?

421

asked Dec 19 '15 12:12

carsten

1 Answers

BeautifulSoup doesn't do any parsing, it uses the output of a dedicated parser (lxml or html.parser or html5lib).

Pick a different parser if the one you are using right now doesn't handle broken HTML quite the way you want it to. lxml is the faster parser and can handle broken HTML quite well, html5lib comes closest to how your browser would parse broken HTML but is a lot slower.

Also see Installing a parser in the BeautifulSoup documentation, as well as the Differences between parsers section.

answered Oct 13 '22 01:10

Martijn Pieters

Related questions
                            
                                Clustering a billion items (or which clustering methods run in linear time?)
                            
                                Converting a formatted time string to milliseconds
                            
                                Python abstract decoration not working
                            
                                Fast way to turn a labeled image into a dictionary of { label : [coordinates] }
                            
                                How to load *.hdr files using python
                            
                                Seaborn: Specify an exact color
                            
                                Python Multiprocessing concurrency using Manager, Pool and a shared list not working
                            
                                Python- Count number of occurrences of a date in a list
                            
                                Relative shebang: How to write an executable script running portable interpreter which comes with it
                            
                                Can I save a range to a variable?
                            
                                Converting some columns of a matrix from float to int
                            
                                What's the best practice for installing development versions of Python modules in Anaconda?
                            
                                How to expose a function returning a C++ object to Python without copying the object?
                            
                                How to register a Custom Form with Django Admin
                            
                                TensorFlow initializing Tensor of ones
                            
                                Chaining difference in ES6 Promises and PEP3148 Futures
                            
                                Python setup.py test dependencies for custom test command
                            
                                Pydub from_mp3 gives [Errno 2] No such file or directory
                            
                                skflow regression predict multiple values
                            
                                django.db.utils.OperationalError: (1046, 'No database selected')

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With