Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to 'clean up' html text

Tags:

python

I have the following text:

"It's the show your only friend and pastor have been talking about! 
<i>Wonder Showzen</i> is a hilarious glimpse into the black 
heart of childhood innocence! Get ready as the complete first season of MTV2's<i> Wonder Showzen</i> tackles valuable life lessons like birth, 
nature, diversity, and history &#8211; all inside the prison of 
your mind! Where else can you..."

What I want to do with this is remove the html tags and encode it into unicode. I am currently doing:

def remove_tags(text):
    return TAG_RE.sub('', text)

Which only strips the tag. How would I correctly encode the above for database storage?

like image 396
David542 Avatar asked Oct 30 '22 21:10

David542


1 Answers

You could try passing your text through a HTML parser. Here is an example using BeautifulSoup:

from bs4 import BeautifulSoup

text = '''It's the show your only friend and pastor have been talking about! 
<i>Wonder Showzen</i> is a hilarious glimpse into the black 
heart of childhood innocence! Get ready as the complete first season of MTV2's<i> Wonder Showzen</i> tackles valuable life lessons like birth, 
nature, diversity, and history &#8211; all inside the prison of 
your mind! Where else can you...'''

soup = BeautifulSoup(text)

>>> soup.text
u"It's the show your only friend and pastor have been talking about! \nWonder Showzen is a hilarious glimpse into the black \nheart of childhood innocence! Get ready as the complete first season of MTV2's Wonder Showzen tackles valuable life lessons like birth, \nnature, diversity, and history \u2013 all inside the prison of \nyour mind! Where else can you..."

You now have a unicode string with the HTML entities converted to unicode escaped characters, i.e. &#8211; was converted to \u2013.

This also removes the HTML tags.

like image 52
mhawke Avatar answered Nov 11 '22 19:11

mhawke