Python: strip html from text data

Question

My question is slightly related to: Strip HTML from strings in Python

I am looking for a simple way to strip HTML code from text. For example:

string = 'foo <SOME_VALID_HTML_TAG> something </SOME_VALID_HTML_TAG> bar'
stripIt(string)

Would then yield foo bar.

Is there any simple tool to achieve this in Python? The HTML code could be nested.

Hugh Bothwell · Accepted Answer

import lxml.html
import re

def stripIt(s):
    doc = lxml.html.fromstring(s)   # parse html string
    txt = doc.xpath('text()')       # ['foo ', ' bar']
    txt = ' '.join(txt)             # 'foo   bar'
    return re.sub('\s+', ' ', txt)  # 'foo bar'

s = 'foo <SOME_VALID_HTML_TAG> something </SOME_VALID_HTML_TAG> bar'
stripIt(s)

returns

foo bar

Python: strip html from text data

Tags:

python

html

Jernej

1 Answers

Hugh Bothwell

Recent Activity

Donate For Us

Python: strip html from text data

Tags:

python

html

Jernej

1 Answers

Hugh Bothwell

Related questions

Recent Activity

Donate For Us