I'm looking for a fast and possibly convenient way in Python 3 to translate strings with non-ascii letters to words with only ascii letters.
Examples!
żółw => zolw
móżdżek => mozdzek
łódź => lodz
and so on...
There are many letters in national alphabets that can be turned into ASCII letters (like ń to n). I can do it manually for my language (Polish), by specifying how to translate each letter. But is there any automated way to do that? Or some library which would do what I need?
Pythons str.encode()
won't do, because "żółw".encode('ascii', 'replace') == "???w"
and "żółw".encode('ascii', 'ignore') == "w"
...
I can do such translation for polish letters but I don't want to do it for every other language:
>>> utf8_letters = ['ą','ę','ć','ź','ż','ó','ł','ń','ś']
>>> ascii_letters = ['a','e','c','z','z','o','l','n','s']
>>> trans_dict = dict(zip(utf8_letters,ascii_letters))
>>> turtle = "żółw"
>>> out = []
>>> for l in turtle:
... out.append(trans_dict[l] if l in trans_dict else l)
>>> result = ''.join(out)
>>> result
'zolw'
The above code does what I want with polish letters, but it's ugly :< What is the best way to do this?
Of course such translations will change the meanings of some words, but thats ok.
The unicodedata module can be used for this.
It has functions to manipulate Unicode character names: name
and lookup
.
Now let's look at them closer.
name('Ż') == 'LATIN CAPITAL LETTER Z WITH DOT ABOVE'
name('ł') == 'LATIN SMALL LETTER L WITH STROKE'
lookup('LATIN CAPITAL LETTER Z') == 'Z'
lookup('LATIN SMALL LETTER L') == 'l'
See a pattern? Let's make a function that utilizes it:
import unicodedata
def normalize_char(c):
try:
cname = unicodedata.name(c)
cname = cname[:cname.index(' WITH')]
return unicodedata.lookup(cname)
except (ValueError, KeyError):
return c
normalize_char('ę') == 'e'
normalize_char('Ę') == 'E'
normalize_char('ś') == 's'
It looks for the word WITH in the character name, removes everything that goes after it and feeds it back to the lookup
function.
If there is no 'WITH', ValueError
is raised and when there is no character with such name, KeyError
is raised, so the function returns the character unchanged.
And here is a function that "translates" a string based on the previous function:
def normalize(s):
return ''.join(normalize_char(c) for c in s)
normalize('Móżdżek') == 'Mozdzek'
So this solution is obviously very good, but I'll leave the previous ones below.
The unicodedata
module also has a function that promises similar results – normalize
with 'NFKD'
parameter (compatibility decomposition), but it misses most characters.
If you have the character data, the code you provided can be improved.
letters={'ł':'l', 'ą':'a', 'ń':'n', 'ć':'c', 'ó':'o', 'ę':'e', 'ś':'s', 'ź':'z', 'ż':'z'}
trans=str.maketrans(letters)
result=text.translate(trans)
Here is a nice table with character data. This is JavaScript but can be used easily for Python.
And if you don't mind using external libraries, you might want to try Unidecode. It was made just for this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With