Is there any way in a regex to specify a match for a character with a specific diacritic? Let's say a grave accent for example. The long way to do this is to go to the Wikipedia page on the grave accent, copy all of the characters it shows, then make a character class out of them: <pre class="prettyprint"><code>/[àầằèềḕìǹòồṑùǜừẁỳ]/i </code></pre> That's quite tedious. I was hoping for a Unicode property like <code>\p{hasGraveAccent}</code>, but I can't find anything like that. Searching for a solution only comes up with questions from people trying to match characters while ignoring diacritics, which involves performing a normalization of some kind, which is not what I want.

It's possible with some limitations. <pre class="prettyprint"><code>#!perl use strict; use warnings; use Encode; use Unicode::Normalize; use charnames qw(); use utf8; # source is utf-8 binmode(STDOUT, ":utf8"); # print in utf-8 my $utf8_string = 'xàaâèaêòͤ'; my $nfd_string = NFD($utf8_string); # decompose my @chars_with_grave = $nfd_string =~ m/ ( \p{L} # one letter \p{M}* # 0 or more marks \N{COMBINING GRAVE ACCENT} \p{M}* # 0 or more marks ) /xmsg; print join(', ',@chars_with_grave), "\n"; </code></pre> This prints <pre class="prettyprint"><code>$ perl utf_match_grave.pl à, è, òͤ </code></pre> NOTE: The characters in the edit area are correctly displayed as combined, but stackoverflow renders them wrongly seperated. It needs a letter as base character. Change the regex for other base characters. Mark <code>\p{M}</code> is maybe not exactly what you want, should be improved.

regex match character with specific diacritic

Tags:

regex

unicode

perl

diacritics

Is there any way in a regex to specify a match for a character with a specific diacritic? Let's say a grave accent for example. The long way to do this is to go to the Wikipedia page on the grave accent, copy all of the characters it shows, then make a character class out of them:

/[àầằèềḕìǹòồṑùǜừẁỳ]/i

That's quite tedious. I was hoping for a Unicode property like \p{hasGraveAccent}, but I can't find anything like that. Searching for a solution only comes up with questions from people trying to match characters while ignoring diacritics, which involves performing a normalization of some kind, which is not what I want.

773

asked Feb 13 '16 02:02

Nate Glenn

1 Answers

It's possible with some limitations.

#!perl

use strict;
use warnings;

use Encode;
use Unicode::Normalize;
use charnames qw();
use utf8;  # source is utf-8

binmode(STDOUT, ":utf8"); # print in utf-8

my $utf8_string = 'xàaâèaêòͤ';

my $nfd_string = NFD($utf8_string); # decompose

my @chars_with_grave = $nfd_string =~
  m/
    (
      \p{L}           # one letter
      \p{M}*          # 0 or more marks
      \N{COMBINING GRAVE ACCENT}
      \p{M}*          # 0 or more marks
    )
  /xmsg;

print join(', ',@chars_with_grave), "\n";

This prints

$ perl utf_match_grave.pl 
à, è, òͤ

NOTE: The characters in the edit area are correctly displayed as combined, but stackoverflow renders them wrongly seperated.

It needs a letter as base character. Change the regex for other base characters. Mark \p{M} is maybe not exactly what you want, should be improved.

110

answered Oct 19 '22 21:10

Helmut Wollmersdorfer

Related questions
                            
                                Regex Pattern Matching Concatenation
                            
                                Selecting column using REGEXP in MySQL
                            
                                Trying to build a regular expression to check pattern - 2
                            
                                \w will become equivalent to \p{L} in a future?
                            
                                SQL: Feeding SELECT output to LIKE
                            
                                Generating sample data from regex to verify input strings by focussing on boundary cases defined in regex
                            
                                Call to ajax function within replace()
                            
                                Regular expression for validation of a facebook page url
                            
                                replace all line breaks not precede by a period with a regular expression?
                            
                                Replacing multiple occurrences of characters
                            
                                Fuzzy string-matching that can "skip"? e.g. "i am (.*)." has 0 distance to "I am here."
                            
                                Keep caret position in contenteditable after editing the content via jscript
                            
                                How does Facebook's URL matching algorithm work? [duplicate]
                            
                                match ASCII characters except alphanumeric
                            
                                git log with perl regex
                            
                                Vim matches a rectangle area
                            
                                Token pattern for n-gram in TfidfVectorizer in python
                            
                                Syntax Highlighting performance issue
                            
                                Java regex how to find the parent match?
                            
                                Why double slash dot (ie: \\.) in htaccess regex?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With