Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Cleaner, lxml.html without returning div tag?

I have this code:

evil = "<script>malignus script</script><b>bold text</b><i>italic text</i>"
cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'],
                  page_structure=True)
print cleaner.clean_html(evil)

I expected to get this:

<b>bold text</b>italic text

But instead I'm getting this:

<div><b>bold text</b>italic text</div>

Is there an attribute to remove the div tag wrapper?

like image 608
Allan Veloso Avatar asked Jan 29 '14 02:01

Allan Veloso


People also ask

Is lxml faster than BeautifulSoup?

It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better. In the end they are saying, The downside of using this parser is that it is much slower than the HTML parser of lxml.

What is lxml HTML?

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.


2 Answers

lxml expects your html to have a tree structure, ie a single root node. If it does not have one, it adds it.

like image 51
Hugh Bothwell Avatar answered Sep 23 '22 05:09

Hugh Bothwell


Cleaner always wraps the result in an element. A good solution is to parse the HTML manually and send the resulting document object to cleaner- then the result is also a document object, and you can use text_content to extract the text from the root.

from lxml.html import document_fromstring
from lxml.html.clean import Cleaner
evil = "<script>malignus script</script><b>bold text</b><i>italic 
text</i>"
doc = document_fromstring(evil)
cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'],
              page_structure=True)
print cleaner.clean_html(doc).text_content()

This can also be done as a one liner

like image 39
cmc Avatar answered Sep 23 '22 05:09

cmc