Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing HTML tags from a unicode string in Python

I have a strong that I scraped from an XML file and It contains some HTML formatting tags

(<b>, <i>, etc)

Is there a quick and easy way to remove all of these tags from the text?

I tried

str = str.replace("<b>","")

and applied it several times to other tags, but that doesn't work

like image 318
Alex B Avatar asked Feb 12 '26 19:02

Alex B


2 Answers

Using lxml.html:

lxml.html.fromstring(s).text_content()

This strips all tags and converts all entities to their corresponding characters.

Answer depends on your exact needs. You might have a look at regular expressions. But I would advise you to use http://www.crummy.com/software/BeautifulSoup/ if you want to clean up bad xml or html.

like image 23
Achim Avatar answered Feb 15 '26 11:02

Achim