How to use Cleaner, lxml.html without returning div tag?

Tags:

lxml.html

I have this code:

evil = "<script>malignus script</script><b>bold text</b><i>italic text</i>"
cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'],
                  page_structure=True)
print cleaner.clean_html(evil)

I expected to get this:

<b>bold text</b>italic text

But instead I'm getting this:

<div><b>bold text</b>italic text</div>

Is there an attribute to remove the div tag wrapper?

608

asked Jan 29 '14 02:01

2 Answers

lxml expects your html to have a tree structure, ie a single root node. If it does not have one, it adds it.

answered Sep 23 '22 05:09

Cleaner always wraps the result in an element. A good solution is to parse the HTML manually and send the resulting document object to cleaner- then the result is also a document object, and you can use text_content to extract the text from the root.

from lxml.html import document_fromstring
from lxml.html.clean import Cleaner
evil = "<script>malignus script</script><b>bold text</b><i>italic 
text</i>"
doc = document_fromstring(evil)
cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'],
              page_structure=True)
print cleaner.clean_html(doc).text_content()

This can also be done as a one liner

answered Sep 23 '22 05:09

cmc

Related questions
                            
                                Get the Gmail attachment filename without downloading it
                            
                                How to extract meaningful and useful content from web pages? [closed]
                            
                                How to add an element to xml file by using elementtree
                            
                                Evaluation order in python list and tuple literals
                            
                                Replace value in JSON file for key which can be nested by n levels
                            
                                Downloading the files(which are uploaded) from media folder in django 1.4.3
                            
                                Good Call Hierarchy in Eclipse/PyDev
                            
                                Using multiple colors in matplotlib plot
                            
                                Stuck at Flask tutorial step 3
                            
                                Pillow 2.0.0 tutorial [closed]
                            
                                Python Visual Debugger [closed]
                            
                                How to Access the Directory API in Admin SDK
                            
                                Python: wildcard subset import
                            
                                What are good features for classifying photos of clothing? [closed]
                            
                                How to access the meta attributes of a superclass in Python?
                            
                                What's the maximum number of repetitions allowed in a Python regex?
                            
                                Python - Efficient way to find the largest area of a specific value in a 2D numpy array
                            
                                Python searching a large list speed
                            
                                Setuptools console_script entry point not found with install but it's found with develop
                            
                                Managing pip in an RPM environment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use Cleaner, lxml.html without returning div tag?

Tags:

python

lxml.html

Allan Veloso

People also ask

2 Answers

Hugh Bothwell

cmc

Recent Activity

Donate For Us