Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to find all variants of a certain character inside a text

Tags:

java

regex

I am trying to find unicode variants of a user-entered character in a text for highlighting them. E.g. if user enters "Beyonce" i'd like to highlight all text with variants like "Beyoncé" or "Beyônce" or Bèyönce" in the text. Currenty the only idea i have is creating a regex by replacing the input string with a set of character groups like this:

"Beyonce" => "B[eêéè]y[óòôö]c[éèê]"

But this seems to be a very tedious and error prone way of doing it. What I am basically looking for is a regex character group that matches all variants of a given input character, something like \p{M} but with the possibility to specify the base letter. Is there something available like this in java regex? And if not, how could the regex creation process be improved? I don't think that specifying all variants by hand is going to work in the long run.

like image 514
Jan Thomä Avatar asked Nov 05 '22 01:11

Jan Thomä


1 Answers

There are several ways, an accented character can be represented. There's a good example in the javadoc of java.text.Normalizer:

For example, take the character A-acute. In Unicode, this can be encoded
as a single character (the "composed" form):

  U+00C1    LATIN CAPITAL LETTER A WITH ACUTE

or as two separate characters (the "decomposed" form):

  U+0041    LATIN CAPITAL LETTER A
  U+0301    COMBINING ACUTE ACCENT 

The second form would make it relatively easy to access the non-accentuated character, and fortunately Normalizer can help you here:

Normalizer.normalize(text, Form.NFD); // NFD = "Canonical decomposition"

You can then use a regex to ignore (or remove) any non-ASCII characters from the string, based on:

[^\p{ASCII}]
like image 84
Chris Lercher Avatar answered Nov 09 '22 15:11

Chris Lercher