Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Translating letters not in 7bit ASCII to ASCII (like ń to n and ą to a)

I'm looking for a fast and possibly convenient way in Python 3 to translate strings with non-ascii letters to words with only ascii letters.

Examples!

żółw => zolw

móżdżek => mozdzek

łódź => lodz

and so on...

There are many letters in national alphabets that can be turned into ASCII letters (like ń to n). I can do it manually for my language (Polish), by specifying how to translate each letter. But is there any automated way to do that? Or some library which would do what I need?

Pythons str.encode() won't do, because "żółw".encode('ascii', 'replace') == "???w" and "żółw".encode('ascii', 'ignore') == "w"...

I can do such translation for polish letters but I don't want to do it for every other language:

>>> utf8_letters = ['ą','ę','ć','ź','ż','ó','ł','ń','ś']
>>> ascii_letters = ['a','e','c','z','z','o','l','n','s']
>>> trans_dict = dict(zip(utf8_letters,ascii_letters))
>>> turtle = "żółw"
>>> out = []
>>> for l in turtle:
...   out.append(trans_dict[l] if l in trans_dict else l)
>>> result = ''.join(out)
>>> result
'zolw'

The above code does what I want with polish letters, but it's ugly :< What is the best way to do this?

Of course such translations will change the meanings of some words, but thats ok.

like image 513
Maciek Avatar asked Jan 19 '12 23:01

Maciek


1 Answers

The unicodedata module can be used for this. It has functions to manipulate Unicode character names: name and lookup.

Now let's look at them closer.

name('Ż') == 'LATIN CAPITAL LETTER Z WITH DOT ABOVE'
name('ł') == 'LATIN SMALL LETTER L WITH STROKE'
lookup('LATIN CAPITAL LETTER Z') == 'Z'
lookup('LATIN SMALL LETTER L') == 'l'

See a pattern? Let's make a function that utilizes it:

import unicodedata

def normalize_char(c):
    try:
        cname = unicodedata.name(c)
        cname = cname[:cname.index(' WITH')]
        return unicodedata.lookup(cname)
    except (ValueError, KeyError):
        return c

normalize_char('ę') == 'e'
normalize_char('Ę') == 'E'
normalize_char('ś') == 's'

It looks for the word WITH in the character name, removes everything that goes after it and feeds it back to the lookup function.
If there is no 'WITH', ValueError is raised and when there is no character with such name, KeyError is raised, so the function returns the character unchanged.

And here is a function that "translates" a string based on the previous function:

def normalize(s):
    return ''.join(normalize_char(c) for c in s)

normalize('Móżdżek') == 'Mozdzek'

So this solution is obviously very good, but I'll leave the previous ones below.


The unicodedata module also has a function that promises similar results – normalize with 'NFKD' parameter (compatibility decomposition), but it misses most characters.


If you have the character data, the code you provided can be improved.

letters={'ł':'l', 'ą':'a', 'ń':'n', 'ć':'c', 'ó':'o', 'ę':'e', 'ś':'s', 'ź':'z', 'ż':'z'}
trans=str.maketrans(letters)
result=text.translate(trans)

Here is a nice table with character data. This is JavaScript but can be used easily for Python.


And if you don't mind using external libraries, you might want to try Unidecode. It was made just for this.

like image 154
Oleh Prypin Avatar answered Nov 15 '22 16:11

Oleh Prypin