Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex match character with specific diacritic

Is there any way in a regex to specify a match for a character with a specific diacritic? Let's say a grave accent for example. The long way to do this is to go to the Wikipedia page on the grave accent, copy all of the characters it shows, then make a character class out of them:

/[àầằèềḕìǹòồṑùǜừẁỳ]/i

That's quite tedious. I was hoping for a Unicode property like \p{hasGraveAccent}, but I can't find anything like that. Searching for a solution only comes up with questions from people trying to match characters while ignoring diacritics, which involves performing a normalization of some kind, which is not what I want.

like image 773
Nate Glenn Avatar asked Feb 13 '16 02:02

Nate Glenn


People also ask

How do I match a character in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What does \b mean in regex?

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length. There are three different positions that qualify as word boundaries: Before the first character in the string, if the first character is a word character.

What does regex (? S match?

i) makes the regex case insensitive. (? s) for "single line mode" makes the dot match all characters, including line breaks.

How do you match a character after regex?

while non-whitespace characters include all letters, numbers, and punctuation. So essentially, the \s\S combination matches everything.


1 Answers

It's possible with some limitations.

#!perl

use strict;
use warnings;

use Encode;
use Unicode::Normalize;
use charnames qw();
use utf8;  # source is utf-8

binmode(STDOUT, ":utf8"); # print in utf-8

my $utf8_string = 'xàaâèaêòͤ';

my $nfd_string = NFD($utf8_string); # decompose

my @chars_with_grave = $nfd_string =~
  m/
    (
      \p{L}           # one letter
      \p{M}*          # 0 or more marks
      \N{COMBINING GRAVE ACCENT}
      \p{M}*          # 0 or more marks
    )
  /xmsg;

print join(', ',@chars_with_grave), "\n";

This prints

$ perl utf_match_grave.pl 
à, è, òͤ

NOTE: The characters in the edit area are correctly displayed as combined, but stackoverflow renders them wrongly seperated.

It needs a letter as base character. Change the regex for other base characters. Mark \p{M} is maybe not exactly what you want, should be improved.

like image 110
Helmut Wollmersdorfer Avatar answered Oct 19 '22 21:10

Helmut Wollmersdorfer