I am processing HTML using Python and the BeautifulSoup 4 library and I can't find an obvious way to replace
with a space. Instead it seems to be converted to a Unicode non-breaking space character.
Am I missing something obvious? What is the best way to replace with a normal space using BeautifulSoup?
Edit to add that I am using the latest version, BeautifulSoup 4, so the convertEntities=BeautifulSoup.HTML_ENTITIES
option in Beautiful Soup 3 isn't available.
This should do what you're looking for: function clean($string) { $string = str_replace(' ', '-', $string); // Replaces all spaces with hyphens. return preg_replace('/[^A-Za-z0-9\-]/', '', $string); // Removes special chars. } Hope it helpss!!
You can use HTMLParser. unescape() from the standard library: For Python 2.6-2.7 it's in HTMLParser. For Python 3 it's in html.
Yes, is turned into a non-breaking space character. If you really want those to be space characters instead, you'll have to do a unicode replace.
>>> soup = BeautifulSoup('<div>a b</div>') >>> soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' ')) u'<html>\n <body>\n <div>\n a b\n </div>\n </body>\n</html>'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With