Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup: how to ignore spurious end tags

I've read many good things about BeautifulSoup, that's why I'm trying to use it currently to scrape a set of websites with badly formed HTML.

Unfortunately, there's one feature of BeautifulSoup that pretty much is a showstopper currently:

It seems that when BeautifulSoup encounters a closing tag (in my case </p>) that was never opened, it decides to end the document instead. Also, the find method seems to not search the contents behind the (self-induced) </html> tag in this case. This means that when the block I'm interested in happens to be behind a spurious closing tag, I can't access the contents.

Is there a way I can configure BeautifulSoup to ignore unmatched closing tags rather than closing the document when they are encountered?

like image 421
carsten Avatar asked Dec 19 '15 12:12

carsten


People also ask

How do I exclude tags in BeautifulSoup?

Answer #1: You can use extract() to remove unwanted tag before you get text. But it keeps all 'n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).

Can BeautifulSoup handle broken HTML?

It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.

Is tag editable in BeautifulSoup?

The navigablestring object is used to represent the contents of a tag. To access the contents, use “. string” with tag. You can replace the string with another string but you can't edit the existing string.

What is tag in BeautifulSoup?

Going down. One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag's children). Beautiful Soup provides different ways to navigate and iterate over's tag's children.


1 Answers

BeautifulSoup doesn't do any parsing, it uses the output of a dedicated parser (lxml or html.parser or html5lib).

Pick a different parser if the one you are using right now doesn't handle broken HTML quite the way you want it to. lxml is the faster parser and can handle broken HTML quite well, html5lib comes closest to how your browser would parse broken HTML but is a lot slower.

Also see Installing a parser in the BeautifulSoup documentation, as well as the Differences between parsers section.

like image 75
Martijn Pieters Avatar answered Oct 13 '22 01:10

Martijn Pieters