Is there any way to convert string like 'Dziękuję' to 'Dziekuje' or 'šećer' to 'secer' in kotlin. I have tried using java.text.Normalizer but it doesn't seem to work the desired way.
Normalizer only does half the work. Here's how you could use it:
private val REGEX_UNACCENT = "\\p{InCombiningDiacriticalMarks}+".toRegex()
fun CharSequence.unaccent(): String {
val temp = Normalizer.normalize(this, Normalizer.Form.NFD)
return REGEX_UNACCENT.replace(temp, "")
}
assert("áéíóů".unaccent() == "aeiou")
And here's how it works:
We are calling the normalize(). If we pass à, the method returns a + ` . Then using a regular expression, we clean up the string to keep only valid US-ASCII characters.
Source: http://www.rgagnon.com/javadetails/java-0456.html
Note that Normalizer
is a Java class; this is not pure Kotlin and it will only work on JVM.
TL;DR:
Normalizer
to canonically decomposed the Unicode thext.\p{Mn}
).fun String.removeNonSpacingMarks() =
Normalizer.normalize(this, Normalizer.Form.NFD)
.replace("\\p{Mn}+".toRegex(), "")
Long answer:
Using Normalizer you can transform the original text into an equivalent composed or decomposed form.
.
(more info about normalization can be found in the Unicode® Standard Annex #15)
In our case, we are interested in NFD normalization form because it allows us to separate all the combined characters from the base character.
After decomposing the text, we have to run a regex to remove all the new characters resulting from the decomposition that correspond to combined characters.
Combined characters are special characters intended to be positioned relative to an associated base character. The Unicode Standard distinguishes two types of combining characters: spacing and nonspacing.
We are only interested in non-spacing combining characters. Diacritics are the principal class (but not the only one) of this group used with Latin, Greek, and Cyrillic scripts and their relatives.
To remove non-spacing characters with a regex we have to use \p{Mn}
. This group includes all the 1,826 non-spacing characters.
Other answers uses \p{InCombiningDiacriticalMarks}
, this block only includes combining diacritical marks. It is a subset of \p{Mn}
that includes only 112 characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With