Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to translate UTF-8 to ISO8859-1 in Go

I'm trying to map UTF-8 characters to their "similar" ISO8859-1 representation. Removing diacritics, but also replacing characters like Ł with L or ı with i.

Example: José Kakışır should become Jose Kakisir.

I'm aware that removing diacritics can be done this way:

// (From https://blog.golang.org/normalization#TOC_10.)
import (
    "unicode"

    "golang.org/x/text/transform"
    "golang.org/x/text/unicode/norm"
)

isMn := func(r rune) bool {
    return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
result, _, err := transform.String(t, "José Kakışır")
println(result)

Which prints out Jose Karısır - replaced with s, but ı not replaced with i.

What's the best way to achieve that in Go?

like image 402
derFunk Avatar asked Nov 08 '22 13:11

derFunk


1 Answers

There are two ideas from the Unicode spec that might be used to identify "similar" characters.

The first is the decompositions of characters into a base character + a combining mark. Your code takes advantage of this: doing the decomposition and then removing the combining mark, leaving the base character.

But unfortunately the "i" character for some reason does not decompose into a dotless "ı" plus a combining dot (if anybody understands why this decision was made, please comment!). This fact is also discussed here: Why do LATIN SMALL LETTER DOTLESS I, COMBINING DOT ABOVE not get normalized to "i" in NFC form?

The second is the mapping of characters to "confusable" characters as defined in Unicode TR39. For example, you will find the following line in http://www.unicode.org/Public/security/latest/confusables.txt

0131 ; 0069 ; MA # ( ı → i ) LATIN SMALL LETTER DOTLESS I → LATIN SMALL LETTER I #

This mapping exists to identify strings that could be "confused" for other strings for security purposes (e.g. spoofing domains). It allows you to convert a string to its "skeleton": two strings with the same skeleton are potentially visibly confusable. For example the skeleton of "𝔭𝒶ỿ𝕡𝕒ℓ" is "paypal", and the skeleton of "José Kakışır" is "José Kakișir". You could try this for your purposes, but this is not recommended per the spec:

A skeleton is intended only for internal use for testing confusability of strings; the resulting text is not suitable for display to users, because it will appear to be a hodgepodge of different scripts. In particular, the result of mapping an identifier will not necessary be an identifier. Thus the confusability mappings can be used to test whether two identifiers are confusable (if their skeletons are the same), but should definitely not be used as a "normalization" of identifiers.

If you do choose to try this, here is a Go package: https://github.com/mtibben/confusables

Another option is a custom mapping of characters to logically similar characters suitable for your application, based on some knowledgable person's judgment about "similarity". I am not aware of any such mappings. Depending on your application you might try to do this manually.

Also note: "é" and many other accented characters is supported by the iso-8859-1 character set, so removing the accent is not necessary. Whatever you end up implementing, your code should first determine whether the rune is supported by the encoding before attempting to map it to a similar character.

like image 155
Jonathan Warden Avatar answered Nov 15 '22 08:11

Jonathan Warden