I am looking to replace from a large document all high unicode characters, such as accented Es, left and right quotes, etc., with "normal" counterparts in the low range, such as a regular 'E', and straight quotes. I need to perform this on a very large document rather often. I see an example of this in what I think might be perl here: http://www.designmeme.com/mtplugins/lowdown.txt
Is there a fast way of doing this in Python without using s.replace(...).replace(...).replace(...)...? I've tried this on just a few characters to replace and the document stripping became really slow.
EDIT, my version of unutbu's code that doesn't seem to work:
# -*- coding: iso-8859-15 -*-
import unidecode
def ascii_map():
data={}
for num in range(256):
h=num
filename='x{num:02x}'.format(num=num)
try:
mod = __import__('unidecode.'+filename,
fromlist=True)
except ImportError:
pass
else:
for l,val in enumerate(mod.data):
i=h<<8
i+=l
if i >= 0x80:
data[i]=unicode(val)
return data
if __name__=='__main__':
s = u'“fancy“fancy2'
print(s.translate(ascii_map()))
# -*- encoding: utf-8 -*-
import unicodedata
def shoehorn_unicode_into_ascii(s):
return unicodedata.normalize('NFKD', s).encode('ascii','ignore')
if __name__=='__main__':
s = u"éèêàùçÇ"
print(shoehorn_unicode_into_ascii(s))
# eeeaucC
Note, as @Mark Tolonen kindly points out, the method above removes some characters like
ß‘’“”. If the above code truncates characters that you wish translated, then you may have to use the string's translate
method to manually fix these problems. Another option is to use unidecode (see J.F. Sebastian's answer).
When you have a large unicode string, using its translate
method will be much
much faster than using the replace
method.
Edit: unidecode
has a more complete mapping of unicode codepoints to ascii.
However, unidecode.unidecode
loops through the string character-by-character (in a Python loop), which is slower than using the translate
method.
The following helper function uses unidecode
's data files, and the translate
method to attain better speed, especially for long strings.
In my tests on 1-6 MB text files, using ascii_map
is about 4-6 times faster than unidecode.unidecode
.
# -*- coding: utf-8 -*-
import unidecode
def ascii_map():
data={}
for num in range(256):
h=num
filename='x{num:02x}'.format(num=num)
try:
mod = __import__('unidecode.'+filename,
fromlist=True)
except ImportError:
pass
else:
for l,val in enumerate(mod.data):
i=h<<8
i+=l
if i >= 0x80:
data[i]=unicode(val)
return data
if __name__=='__main__':
s = u"éèêàùçÇ"
print(s.translate(ascii_map()))
# eeeaucC
Edit2: Rhubarb, if # -*- encoding: utf-8 -*-
is causing a SyntaxError, try
# -*- encoding: cp1252 -*-
. What encoding to declare depends on what encoding your text editor uses to save the file. Linux tends to use utf-8, and (it seems perhaps) Windows tends to cp1252.
There is no such thing as a "high ascii character". The ASCII character set is limited to ordinal in range(128).
That aside, this is a FAQ. Here's one answer. In general, you should familiarise yourself with str.translate() and unicode.translate() -- very handy for multiple substitutions of single bytes/characters. Beware of answers that mention only the unicodedata.normalize() gimmick; that's just one part of the solution.
Update: The currently-accepted answer blows away characters that don't have a decomposition, as pointed out by Mark Tolonen. There seems to be a lack of knowledge of what unicode.translate()
is capable of. It CAN translate one character into multiple characters. Here is the output from help(unicode.translate)
:
S.translate(table) -> unicode
Return a copy of the string S, where all characters have been mapped through the given translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted.
Here's an example:
>>> u"Gau\xdf".translate({0xdf: u"ss"})
u'Gauss'
>>>
Here's a table of fix-ups from the solution that I pointed to:
CHAR_REPLACEMENT = {
# latin-1 characters that don't have a unicode decomposition
0xc6: u"AE", # LATIN CAPITAL LETTER AE
0xd0: u"D", # LATIN CAPITAL LETTER ETH
0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE
0xde: u"Th", # LATIN CAPITAL LETTER THORN
0xdf: u"ss", # LATIN SMALL LETTER SHARP S
0xe6: u"ae", # LATIN SMALL LETTER AE
0xf0: u"d", # LATIN SMALL LETTER ETH
0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE
0xfe: u"th", # LATIN SMALL LETTER THORN
}
This can be easily extended to cater for the fancy quotes and other non-latin-1 characters found in cp1252 and siblings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With