I have this code:
evil = "<script>malignus script</script><b>bold text</b><i>italic text</i>"
cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'],
page_structure=True)
print cleaner.clean_html(evil)
I expected to get this:
<b>bold text</b>italic text
But instead I'm getting this:
<div><b>bold text</b>italic text</div>
Is there an attribute to remove the div
tag wrapper?
It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better. In the end they are saying, The downside of using this parser is that it is much slower than the HTML parser of lxml.
lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
lxml expects your html to have a tree structure, ie a single root node. If it does not have one, it adds it.
Cleaner always wraps the result in an element. A good solution is to parse the HTML manually and send the resulting document object to cleaner- then the result is also a document object, and you can use text_content to extract the text from the root.
from lxml.html import document_fromstring
from lxml.html.clean import Cleaner
evil = "<script>malignus script</script><b>bold text</b><i>italic
text</i>"
doc = document_fromstring(evil)
cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'],
page_structure=True)
print cleaner.clean_html(doc).text_content()
This can also be done as a one liner
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With