Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing accents and diacritics in kotlin

Tags:

string

kotlin

Is there any way to convert string like 'Dziękuję' to 'Dziekuje' or 'šećer' to 'secer' in kotlin. I have tried using java.text.Normalizer but it doesn't seem to work the desired way.

like image 355
Rebronja Avatar asked Aug 07 '18 16:08

Rebronja


2 Answers

Normalizer only does half the work. Here's how you could use it:

private val REGEX_UNACCENT = "\\p{InCombiningDiacriticalMarks}+".toRegex()

fun CharSequence.unaccent(): String {
    val temp = Normalizer.normalize(this, Normalizer.Form.NFD)
    return REGEX_UNACCENT.replace(temp, "")
}

assert("áéíóů".unaccent() == "aeiou")

And here's how it works:

We are calling the normalize(). If we pass à, the method returns a + ` . Then using a regular expression, we clean up the string to keep only valid US-ASCII characters.

Source: http://www.rgagnon.com/javadetails/java-0456.html

Note that Normalizer is a Java class; this is not pure Kotlin and it will only work on JVM.

like image 200
Eugen Pechanec Avatar answered Sep 20 '22 09:09

Eugen Pechanec


TL;DR:

  1. Use Normalizer to canonically decomposed the Unicode thext.
  2. Remove non-spacing combining characters (\p{Mn}).

fun String.removeNonSpacingMarks() =
    Normalizer.normalize(this, Normalizer.Form.NFD)
    .replace("\\p{Mn}+".toRegex(), "")

Long answer:

Using Normalizer you can transform the original text into an equivalent composed or decomposed form.

  • NFD: Canonical decomposition.
  • NFC: Canonical decomposition, followed by canonical composition.

Canonical Composites.
(more info about normalization can be found in the Unicode® Standard Annex #15)

In our case, we are interested in NFD normalization form because it allows us to separate all the combined characters from the base character.

After decomposing the text, we have to run a regex to remove all the new characters resulting from the decomposition that correspond to combined characters.

Combined characters are special characters intended to be positioned relative to an associated base character. The Unicode Standard distinguishes two types of combining characters: spacing and nonspacing.

We are only interested in non-spacing combining characters. Diacritics are the principal class (but not the only one) of this group used with Latin, Greek, and Cyrillic scripts and their relatives.

To remove non-spacing characters with a regex we have to use \p{Mn}. This group includes all the 1,826 non-spacing characters.

Other answers uses \p{InCombiningDiacriticalMarks}, this block only includes combining diacritical marks. It is a subset of \p{Mn} that includes only 112 characters.

like image 39
David Miguel Avatar answered Sep 18 '22 09:09

David Miguel