Best way to 'clean up' html text

Question

I have the following text:

"It's the show your only friend and pastor have been talking about! 
<i>Wonder Showzen</i> is a hilarious glimpse into the black 
heart of childhood innocence! Get ready as the complete first season of MTV2's<i> Wonder Showzen</i> tackles valuable life lessons like birth, 
nature, diversity, and history &#8211; all inside the prison of 
your mind! Where else can you..."

What I want to do with this is remove the html tags and encode it into unicode. I am currently doing:

def remove_tags(text):
    return TAG_RE.sub('', text)

Which only strips the tag. How would I correctly encode the above for database storage?

mhawke · Accepted Answer

You could try passing your text through a HTML parser. Here is an example using BeautifulSoup:

from bs4 import BeautifulSoup

text = '''It's the show your only friend and pastor have been talking about! 
<i>Wonder Showzen</i> is a hilarious glimpse into the black 
heart of childhood innocence! Get ready as the complete first season of MTV2's<i> Wonder Showzen</i> tackles valuable life lessons like birth, 
nature, diversity, and history &#8211; all inside the prison of 
your mind! Where else can you...'''

soup = BeautifulSoup(text)

>>> soup.text
u"It's the show your only friend and pastor have been talking about! 
Wonder Showzen is a hilarious glimpse into the black 
heart of childhood innocence! Get ready as the complete first season of MTV2's Wonder Showzen tackles valuable life lessons like birth, 
nature, diversity, and history \u2013 all inside the prison of 
your mind! Where else can you..."

You now have a unicode string with the HTML entities converted to unicode escaped characters, i.e. – was converted to \u2013.

This also removes the HTML tags.

Best way to 'clean up' html text

Tags:

python

David542

1 Answers

mhawke

Recent Activity

Donate For Us

Best way to 'clean up' html text

Tags:

python

David542

1 Answers

mhawke

Related questions

Recent Activity

Donate For Us