I am trying to find unicode variants of a user-entered character in a text for highlighting them. E.g. if user enters "Beyonce" i'd like to highlight all text with variants like "Beyoncé" or "Beyônce" or Bèyönce" in the text. Currenty the only idea i have is creating a regex by replacing the input string with a set of character groups like this:
"Beyonce" => "B[eêéè]y[óòôö]c[éèê]"
But this seems to be a very tedious and error prone way of doing it. What I am basically looking for is a regex character group that matches all variants of a given input character, something like \p{M} but with the possibility to specify the base letter. Is there something available like this in java regex? And if not, how could the regex creation process be improved? I don't think that specifying all variants by hand is going to work in the long run.
There are several ways, an accented character can be represented. There's a good example in the javadoc of java.text.Normalizer:
For example, take the character A-acute. In Unicode, this can be encoded
as a single character (the "composed" form):
U+00C1 LATIN CAPITAL LETTER A WITH ACUTE
or as two separate characters (the "decomposed" form):
U+0041 LATIN CAPITAL LETTER A
U+0301 COMBINING ACUTE ACCENT
The second form would make it relatively easy to access the non-accentuated character, and fortunately Normalizer can help you here:
Normalizer.normalize(text, Form.NFD); // NFD = "Canonical decomposition"
You can then use a regex to ignore (or remove) any non-ASCII characters from the string, based on:
[^\p{ASCII}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With