Is there any way in a regex to specify a match for a character with a specific diacritic? Let's say a grave accent for example. The long way to do this is to go to the Wikipedia page on the grave accent, copy all of the characters it shows, then make a character class out of them:
/[àầằèềḕìǹòồṑùǜừẁỳ]/i
That's quite tedious. I was hoping for a Unicode property like \p{hasGraveAccent}
, but I can't find anything like that. Searching for a solution only comes up with questions from people trying to match characters while ignoring diacritics, which involves performing a normalization of some kind, which is not what I want.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length. There are three different positions that qualify as word boundaries: Before the first character in the string, if the first character is a word character.
i) makes the regex case insensitive. (? s) for "single line mode" makes the dot match all characters, including line breaks.
while non-whitespace characters include all letters, numbers, and punctuation. So essentially, the \s\S combination matches everything.
It's possible with some limitations.
#!perl
use strict;
use warnings;
use Encode;
use Unicode::Normalize;
use charnames qw();
use utf8; # source is utf-8
binmode(STDOUT, ":utf8"); # print in utf-8
my $utf8_string = 'xàaâèaêòͤ';
my $nfd_string = NFD($utf8_string); # decompose
my @chars_with_grave = $nfd_string =~
m/
(
\p{L} # one letter
\p{M}* # 0 or more marks
\N{COMBINING GRAVE ACCENT}
\p{M}* # 0 or more marks
)
/xmsg;
print join(', ',@chars_with_grave), "\n";
This prints
$ perl utf_match_grave.pl
à, è, òͤ
NOTE: The characters in the edit area are correctly displayed as combined, but stackoverflow renders them wrongly seperated.
It needs a letter as base character. Change the regex for other base characters. Mark \p{M}
is maybe not exactly what you want, should be improved.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With