I've read many good things about BeautifulSoup, that's why I'm trying to use it currently to scrape a set of websites with badly formed HTML.
Unfortunately, there's one feature of BeautifulSoup that pretty much is a showstopper currently:
It seems that when BeautifulSoup encounters a closing tag (in my case </p>
) that was never opened, it decides to end the document instead.
Also, the find
method seems to not search the contents behind the (self-induced) </html>
tag in this case. This means that when the block I'm interested in happens to be behind a spurious closing tag, I can't access the contents.
Is there a way I can configure BeautifulSoup to ignore unmatched closing tags rather than closing the document when they are encountered?
Answer #1: You can use extract() to remove unwanted tag before you get text. But it keeps all 'n' and spaces so you will need some work to remove them. You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).
It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.
The navigablestring object is used to represent the contents of a tag. To access the contents, use “. string” with tag. You can replace the string with another string but you can't edit the existing string.
Going down. One of the important pieces of element in any piece of HTML document are tags, which may contain other tags/strings (tag's children). Beautiful Soup provides different ways to navigate and iterate over's tag's children.
BeautifulSoup doesn't do any parsing, it uses the output of a dedicated parser (lxml
or html.parser
or html5lib
).
Pick a different parser if the one you are using right now doesn't handle broken HTML quite the way you want it to. lxml
is the faster parser and can handle broken HTML quite well, html5lib
comes closest to how your browser would parse broken HTML but is a lot slower.
Also see Installing a parser in the BeautifulSoup documentation, as well as the Differences between parsers section.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With