Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to replace Latin unicode character to [a-z] characters

I'm trying to convert all Latin unicode Character into their [a-z] representations

ó --> o
í --> i

I can easily do one by one for example:

myString = myString.replaceAll("ó","o");

but since there are tons of variations, this approach is just impractical

Is there another way of doing it in Java? for example a regular Expression, or a utility library

USE CASE:

1- city names from another languages into english e.g.

Espírito Santo --> Espirito Santo,

like image 223
nafas Avatar asked Sep 22 '15 13:09

nafas


People also ask

Does Z have a Unicode value?

Unicode Character “Z” (U+005A)

What is the difference between Latin and Unicode?

Unicode uses 8-, 16-, or 32-bit characters depending on the specific representation, so Unicode documents often require up to twice as much disk space as ASCII or Latin-1 documents. The first 256 characters of Unicode are identical to Latin-1.

What is the Unicode for a character?

Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.

How many characters in Unicode?

Q: How many characters are in Unicode? The short answer is that as of Version 15.0, the Unicode Standard contains 149,186 characters. The long answer is rather more complicated, because of all the different kinds of characters that people might be interested in counting.


1 Answers

This answer requires Java 1.6 or above, which added java.text.Normalizer.

    String normalized = Normalizer.normalize(input, Normalizer.Form.NFD);
    String accentRemoved = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");

Example:

public class Main {
    public static void main(String[] args) {
        String input = "Árvíztűrő tükörfúrógép";
        System.out.println("Input: " + input);
        String normalized = Normalizer.normalize(input, Normalizer.Form.NFD);
        System.out.println("Normalized: " + normalized);
        String accentRemoved = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
        System.out.println("Result: " + accentRemoved);
    }
}

Result:

Input: Árvíztűrő tükörfúrógép
Result: Arvizturo tukorfurogep
like image 186
EpicPandaForce Avatar answered Sep 21 '22 23:09

EpicPandaForce