Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Emacs lisp: Translate characters to standard ASCII transcription

I am trying to write a function, that translates a string containing unicode characters into some default ASCII transcription. Ideally I'd like e.g. Ångström to become Angstroem or, if that is not possible, Angstrom. Likewise α=χ should become a=x (c?) or similar.

Does Emacs have such built-in capabilities? I know I can get the names and similar of characters (get-char-code-property) but I know no built-in transcription table.

The purpose is to translate titles of entries into meaningfully readable filenames, avoiding problems with software that doesn't understand unicode.

My current strategy is to build a translation-table by hand, but this approach is fairly limited and requires a lot of maintenance.

like image 954
kdb Avatar asked Nov 02 '22 20:11

kdb


1 Answers

There is no built-in capability that i know of. I wrote a package unidecode specifically for your task. It uses the same approach as in Python's same-named library. To install just add MELPA repository to your repository list:

(add-to-list 'package-archives
  '("melpa" . "http://melpa.milkbox.net/packages/") t)

Then run M-x package-install RET unidecode. unidecode has 2 functions, unidecode-unidecode that turns Unicode into ASCII, and unidecode-sanitize that discards non-alphanumeric characters and transforms space into hyphen.

ELISP> (unidecode-unidecode "¡Hola!, Grüß Gott, Hyvää päivää, Tere õhtust, Bonġu Cześć!, Dobrý den, Здравствуйте!, Γειά σας, გამარჯობა")
"!Hola!, Gruss Gott, Hyvaa paivaa, Tere ohtust, Bongu Czesc!, Dobry den, Zdravstvuite!, Geia sas, lmsllmlllmckhmslmgll"
ELISP> (unidecode-sanitize "¡Hola!, Grüß Gott, Hyvää päivää, Tere õhtust, Bonġu Cześć!, Dobrý den, Здравствуйте!, Γειά σας, გამარჯობა")
"hola-gruss-gott-hyvaa-paivaa-tere-ohtust-bongu-czesc-dobry-den-zdravstvuite-geia-sas-lmsllmlllmckhmslmgll"
like image 87
Mirzhan Irkegulov Avatar answered Nov 08 '22 04:11

Mirzhan Irkegulov