My question is slightly related to: Strip HTML from strings in Python
I am looking for a simple way to strip HTML code from text. For example:
string = 'foo <SOME_VALID_HTML_TAG> something </SOME_VALID_HTML_TAG> bar'
stripIt(string)
Would then yield foo bar
.
Is there any simple tool to achieve this in Python? The HTML code could be nested.
import lxml.html
import re
def stripIt(s):
doc = lxml.html.fromstring(s) # parse html string
txt = doc.xpath('text()') # ['foo ', ' bar']
txt = ' '.join(txt) # 'foo bar'
return re.sub('\s+', ' ', txt) # 'foo bar'
s = 'foo <SOME_VALID_HTML_TAG> something </SOME_VALID_HTML_TAG> bar'
stripIt(s)
returns
foo bar
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With