Removing HTML tags from a unicode string in Python

Question

I have a strong that I scraped from an XML file and It contains some HTML formatting tags

(<b>, <i>, etc)

Is there a quick and easy way to remove all of these tags from the text?

I tried

str = str.replace("<b>","")

and applied it several times to other tags, but that doesn't work

Admin · Accepted Answer

Using lxml.html:

lxml.html.fromstring(s).text_content()

This strips all tags and converts all entities to their corresponding characters.

Achim · Answer

Answer depends on your exact needs. You might have a look at regular expressions. But I would advise you to use http://www.crummy.com/software/BeautifulSoup/ if you want to clean up bad xml or html.

Removing HTML tags from a unicode string in Python

Tags:

python

html

string

replace

unicode

Alex B

2 Answers

Achim

Recent Activity

Donate For Us

Removing HTML tags from a unicode string in Python

Tags:

python

html

string

replace

unicode

Alex B

2 Answers

Achim

Related questions

Recent Activity

Donate For Us