Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I replace or remove HTML entities like " " using BeautifulSoup 4

Tags:

I am processing HTML using Python and the BeautifulSoup 4 library and I can't find an obvious way to replace   with a space. Instead it seems to be converted to a Unicode non-breaking space character.

Am I missing something obvious? What is the best way to replace   with a normal space using BeautifulSoup?

Edit to add that I am using the latest version, BeautifulSoup 4, so the convertEntities=BeautifulSoup.HTML_ENTITIES option in Beautiful Soup 3 isn't available.

like image 936
Richard Neish Avatar asked Feb 28 '13 14:02

Richard Neish


People also ask

How do I remove special characters in HTML?

This should do what you're looking for: function clean($string) { $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens. return preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars. } Hope it helpss!!

How do you unescape HTML entities in Python?

You can use HTMLParser. unescape() from the standard library: For Python 2.6-2.7 it's in HTMLParser. For Python 3 it's in html.

What is& nbsp Python?

Yes,   is turned into a non-breaking space character. If you really want those to be space characters instead, you'll have to do a unicode replace.


1 Answers

>>> soup = BeautifulSoup('<div>a&nbsp;b</div>') >>> soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' ')) u'<html>\n <body>\n  <div>\n   a b\n  </div>\n </body>\n</html>' 
like image 117
Fabian Avatar answered Sep 18 '22 17:09

Fabian