Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression to remove html tags from a string in Python

I am fetching my resut from a RSS feed using following code:

try:
    desc = item.xpath('description')[0].text
    if date is not None:
        desc =date +"\n"+"\n"+desc
except:
    desc = None

But sometimes the description contains html tags inside RSS feed as below:

This is samle text

< img src="http://imageURL" alt="" />

While displaying the content I do not want any HTML tags to be displayed on page. Is there any regular expression to remove the HTML tags.

like image 616
Simsons Avatar asked Dec 04 '25 01:12

Simsons


1 Answers

Try:

pattern = re.compile(u'<\/?\w+\s*[^>]*?\/?>', re.DOTALL | re.MULTILINE | re.IGNORECASE | re.UNICODE)
text = pattern.sub(u" ", text)
like image 95
pricco Avatar answered Dec 06 '25 16:12

pricco