I have a string of HTML stored in a database. Unfortunately it contains characters such as ® I want to replace these characters by their HTML equivalent, either in the DB itself or using a Find Replace in my Python / Django code.
Any suggestions on how I can do this?
In python, to remove non-ASCII characters in python, we need to use string. encode() with encoding as ASCII and error as ignore, to returns a string without ASCII character use string. decode().
By using encode and decode function we can easily remove non-ASCII characters from Pandas DataFrame. In Python, the encode() function is used to encode the string using a given encoding, and decoding means converting a string of bytes to a Unicode string.
You can use that the ASCII characters are the first 128 ones, so get the number of each character with ord
and strip it if it's out of range
# -*- coding: utf-8 -*-
def strip_non_ascii(string):
''' Returns the string without non ASCII characters'''
stripped = (c for c in string if 0 < ord(c) < 127)
return ''.join(stripped)
test = u'éáé123456tgreáé@€'
print test
print strip_non_ascii(test)
Result
éáé123456tgreáé@€
123456tgre@
Please note that @
is included because, well, after all it's an ASCII character. If you want to strip a particular subset (like just numbers and uppercase and lowercase letters), you can limit the range looking at a ASCII table
EDITED: After reading your question again, maybe you need to escape your HTML code, so all those characters appears correctly once rendered. You can use the escape
filter on your templates.
There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481
To remove non-ASCII characters from a string, s
, use:
s = s.encode('ascii',errors='ignore')
Then convert it from bytes back to a string using:
s = s.decode()
This all using Python 3.6
I found this a while ago, so this isn't in any way my work. I can't find the source, but here's the snippet from my code.
def unicode_escape(unistr):
"""
Tidys up unicode entities into HTML friendly entities
Takes a unicode string as an argument
Returns a unicode string
"""
import htmlentitydefs
escaped = ""
for char in unistr:
if ord(char) in htmlentitydefs.codepoint2name:
name = htmlentitydefs.codepoint2name.get(ord(char))
entity = htmlentitydefs.name2codepoint.get(name)
escaped +="&#" + str(entity)
else:
escaped += char
return escaped
Use it like this
>>> from zack.utilities import unicode_escape
>>> unicode_escape(u'such as ® I want')
u'such as ® I want'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With