python [lxml] - cleaning out html tags

Tags:

from lxml.html.clean import clean_html, Cleaner
    def clean(text):
        try:        
            cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True,
                      remove_tags = ['a', 'li', 'td'])
            print (len(cleaner.clean_html(text))- len(text))
            return cleaner.clean_html(text) 
        except:
            print 'Error in clean_html'
            print sys.exc_info()
            return text

I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html pages, so in the end i am just left with the text and nothing else - but try as i might, the above doesnt appear to work as such, i'm still left with a substial amount of markup (and it doesnt appear to be broken html), and particularly links, which aren't getting removed, despite the args i use in remove_tags and links=True

any idea whats going on, perhaps im barking up the wrong tree with lxml ? i thought this was the way to go with html parsing in python?

233

asked Jun 01 '10 13:06

sadhu_

1 Answers

solution from David concatenates the text with no separator:

Click to copy

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

but this one helped me - concatenation the way I needed:

Click to copy

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))

195

answered Oct 20 '22 15:10

Robert Lujo

Related questions
                            
                                Determining a variable's type is NoneType in python [duplicate]
                            
                                No module named imutils.perspective after pip installing
                            
                                Pandas style object with multi-index
                            
                                append the data to already existing table in pandas using to_sql
                            
                                Python: Too long raw string, multiple lines
                            
                                Logical operation between two Boolean lists
                            
                                Convert pandas.DataFrame to list of dictionaries in Python
                            
                                pandas get minimum of one column in group when groupby another
                            
                                PyCharm running Python file always opens a new console
                            
                                RuntimeError: Timeout context manager should be used inside a task
                            
                                What type is a sklearn model?
                            
                                How to deal with Kivy installing error in Python 3.8?
                            
                                Django and Azure SQL key error 'deferrable' when start migrate command
                            
                                numpy create array of the max of consecutive pairs in another array
                            
                                find_element_by_* commands are deprecated in selenium
                            
                                How can I stop a While loop?
                            
                                python string join performance
                            
                                Create function through MySQLdb
                            
                                In Python, how do you change an instantiated object after a reload?
                            
                                Best DataMining Database

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python [lxml] - cleaning out html tags

Tags:

python

parsing

lxml

sadhu_

People also ask

1 Answers

Robert Lujo

Recent Activity

Donate For Us