I'm trying to convert all Latin unicode Character into their [a-z]
representations
ó --> o
í --> i
I can easily do one by one for example:
myString = myString.replaceAll("ó","o");
but since there are tons of variations, this approach is just impractical
Is there another way of doing it in Java? for example a regular Expression
, or a utility library
USE CASE:
1- city names from another languages into english e.g.
Espírito Santo --> Espirito Santo,
Unicode Character “Z” (U+005A)
Unicode uses 8-, 16-, or 32-bit characters depending on the specific representation, so Unicode documents often require up to twice as much disk space as ASCII or Latin-1 documents. The first 256 characters of Unicode are identical to Latin-1.
Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.
Q: How many characters are in Unicode? The short answer is that as of Version 15.0, the Unicode Standard contains 149,186 characters. The long answer is rather more complicated, because of all the different kinds of characters that people might be interested in counting.
This answer requires Java 1.6 or above, which added java.text.Normalizer.
String normalized = Normalizer.normalize(input, Normalizer.Form.NFD);
String accentRemoved = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
Example:
public class Main {
public static void main(String[] args) {
String input = "Árvíztűrő tükörfúrógép";
System.out.println("Input: " + input);
String normalized = Normalizer.normalize(input, Normalizer.Form.NFD);
System.out.println("Normalized: " + normalized);
String accentRemoved = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
System.out.println("Result: " + accentRemoved);
}
}
Result:
Input: Árvíztűrő tükörfúrógép
Result: Arvizturo tukorfurogep
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With