I have a unicode string with accented latin chars e.g.
n=unicode('Wikipédia, le projet d’encyclopédie','utf-8')
I want to convert it to plain ascii i.e. 'Wikipedia, le projet dencyclopedie', so all acute/accent,cedilla etc should get removed
What is the fastest way to do that, as it needed to be done for matching a long autocomplete dropdown list
Conclusion: As one my criteria is speed, Lennart's 'register your own error handler for unicode encoding/decoding' gives best result (see Alex's answer), speed difference increases further as more and more chars are latin.
Here is the translation table I am using, also modified error handler as it need to take care of whole range of un-encoded char from error.start to error.end
# -*- coding: utf-8 -*-
import codecs
"""
This is more of visual translation also avoiding multiple char translation
e.g. £ may be written as {pound}
"""
latin_dict = {
u"¡": u"!", u"¢": u"c", u"£": u"L", u"¤": u"o", u"¥": u"Y",
u"¦": u"|", u"§": u"S", u"¨": u"`", u"©": u"c", u"ª": u"a",
u"«": u"<<", u"¬": u"-", u"": u"-", u"®": u"R", u"¯": u"-",
u"°": u"o", u"±": u"+-", u"²": u"2", u"³": u"3", u"´": u"'",
u"µ": u"u", u"¶": u"P", u"·": u".", u"¸": u",", u"¹": u"1",
u"º": u"o", u"»": u">>", u"¼": u"1/4", u"½": u"1/2", u"¾": u"3/4",
u"¿": u"?", u"À": u"A", u"Á": u"A", u"Â": u"A", u"Ã": u"A",
u"Ä": u"A", u"Å": u"A", u"Æ": u"Ae", u"Ç": u"C", u"È": u"E",
u"É": u"E", u"Ê": u"E", u"Ë": u"E", u"Ì": u"I", u"Í": u"I",
u"Î": u"I", u"Ï": u"I", u"Ð": u"D", u"Ñ": u"N", u"Ò": u"O",
u"Ó": u"O", u"Ô": u"O", u"Õ": u"O", u"Ö": u"O", u"×": u"*",
u"Ø": u"O", u"Ù": u"U", u"Ú": u"U", u"Û": u"U", u"Ü": u"U",
u"Ý": u"Y", u"Þ": u"p", u"ß": u"b", u"à": u"a", u"á": u"a",
u"â": u"a", u"ã": u"a", u"ä": u"a", u"å": u"a", u"æ": u"ae",
u"ç": u"c", u"è": u"e", u"é": u"e", u"ê": u"e", u"ë": u"e",
u"ì": u"i", u"í": u"i", u"î": u"i", u"ï": u"i", u"ð": u"d",
u"ñ": u"n", u"ò": u"o", u"ó": u"o", u"ô": u"o", u"õ": u"o",
u"ö": u"o", u"÷": u"/", u"ø": u"o", u"ù": u"u", u"ú": u"u",
u"û": u"u", u"ü": u"u", u"ý": u"y", u"þ": u"p", u"ÿ": u"y",
u"’":u"'"}
def latin2ascii(error):
"""
error is protion of text from start to end, we just convert first
hence return error.start+1 instead of error.end
"""
return latin_dict[error.object[error.start]], error.start+1
codecs.register_error('latin2ascii', latin2ascii)
if __name__ == "__main__":
x = u"¼ éíñ§ÐÌëÑ » ¼ ö ® © ’"
print x
print x.encode('ascii', 'latin2ascii')
Why I return error.start + 1
:
error object returned can be multiple characters, and we convert only first of these e.g. if I add print error.start, error.end
to error handler output is
¼ éíñ§ÐÌëÑ » ¼ ö ® © ’
0 1
2 10
3 10
4 10
5 10
6 10
7 10
8 10
9 10
11 12
13 14
15 16
17 18
19 20
21 22
1/4 einSDIeN >> 1/4 o R c '
so in second line we get chars from 2-10 but we convert only 2nd hence return 3 as continue point, if we return error.end output is
¼ éíñ§ÐÌëÑ » ¼ ö ® © ’
0 1
2 10
11 12
13 14
15 16
17 18
19 20
21 22
1/4 e >> 1/4 o R c '
As we can see 2-10 portion has been replaced by a single char. off-course it would be faster to just encode whole range in one go and return error.end, but for demonstration purpose I have kept it simple.
see http://docs.python.org/library/codecs.html#codecs.register_error for more details
Latin1 charset (iso-8859) is 100% compatible to be stored in a utf8 datastore. All ascii & extended-ascii chars will be stored as single-byte. Going the other way, from utf8 to Latin1 charset may or may not work. If there are any 2-byte chars (chars beyond extended-ascii 255) they will not store in a Latin1 datastore.
ISO 8859 is an eight-bit extension to ASCII developed by ISO (the International Organization for Standardization). ISO 8859 includes the 128 ASCII characters along with an additional 128 characters, such as the British pound symbol and the American cent symbol.
As of August 2022, 1.3% of all (but only 8 of the top 1000) websites use ISO/IEC 8859-1. It is the most declared single-byte character encoding in the world on the web, but as web browsers interpret it as the superset Windows-1252 the documents may include characters from that set.
So here are three approaches, more or less as given or suggested in other answers:
# -*- coding: utf-8 -*-
import codecs
import unicodedata
x = u"Wikipédia, le projet d’encyclopédie"
xtd = {ord(u'’'): u"'", ord(u'é'): u'e', }
def asciify(error):
return xtd[ord(error.object[error.start])], error.end
codecs.register_error('asciify', asciify)
def ae():
return x.encode('ascii', 'asciify')
def ud():
return unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore')
def tr():
return x.translate(xtd)
if __name__ == '__main__':
print 'or:', x
print 'ae:', ae()
print 'ud:', ud()
print 'tr:', tr()
Run as main, this emits:
or: Wikipédia, le projet d’encyclopédie
ae: Wikipedia, le projet d'encyclopedie
ud: Wikipedia, le projet dencyclopedie
tr: Wikipedia, le projet d'encyclopedie
showing clearly that the unicodedata-based approach, while it does have the convenience of not needing a translation map xtd
, can't translate all characters properly in an automated fashion (it works for accented letters but not for the reverse-apostrophe), so it would also need some auxiliary step to deal explicitly with those (no doubt before what's now its body).
Performance is also interesting. On my laptop with Mac OS X 10.5 and system Python 2.5, quite repeatably:
$ python -mtimeit -s'import a' 'a.ae()'
100000 loops, best of 3: 7.5 usec per loop
$ python -mtimeit -s'import a' 'a.ud()'
100000 loops, best of 3: 3.66 usec per loop
$ python -mtimeit -s'import a' 'a.tr()'
10000 loops, best of 3: 21.4 usec per loop
translate
is surprisingly slow (relative to the other approaches). I believe the issue is that the dict is looked into for every character in the translate
case (and most are not there), but only for those few characters that ARE there with the asciify
approach.
So for completeness here's "beefed-up unicodedata" approach:
specstd = {ord(u'’'): u"'", }
def specials(error):
return specstd.get(ord(error.object[error.start]), u''), error.end
codecs.register_error('specials', specials)
def bu():
return unicodedata.normalize('NFKD', x).encode('ASCII', 'specials')
this gives the right output, BUT:
$ python -mtimeit -s'import a' 'a.bu()'
100000 loops, best of 3: 10.7 usec per loop
...speed isn't all that good any more. So, if speed matters, it's no doubt worth the trouble of making a complete xtd
translation dict and using the asciify
approach. When a few extra microseconds per translation are no big deal, one might want to consider the bu
approach simply for its convenience (only needs a translation dict for, hopefully few, special characters that don't translate correctly with the underlying unicodedata idea).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With