Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to combine multiple Unicode properties in perl regex?

Tags:

unicode

perl

Have this script:

use 5.014;
use warnings;

use utf8;    
binmode STDOUT, ':utf8';

my $str = "XYZ ΦΨΩ zyz φψω";

my @greek = ($str =~ /\p{Greek}/g);
say "Greek: @greek";

my @upper = ($str =~ /\p{Upper}/g);
say "Upper: @upper";

#my @upper_greek = ($str =~ /\p{Upper+Greek}/); #wrong.
#say "Upper+Greek: @upper_greek";

Is possible combine multiple unicode properties? E.g how to select only Upper and Greek, and get the wanted:

Greek: Φ Ψ Ω φ ψ ω
Upper: X Y Z Φ Ψ Ω
Upper+Greek: Φ Ψ Ω      #<-- how to get this?
like image 276
jm666 Avatar asked Apr 05 '17 18:04

jm666


2 Answers

We want to perform an AND operation, so we can't use

/(?:\p{Greek}|\p{Upper})/         # Greek OR Upper

or

/[\p{Greek}\p{Upper}]/            # Greek OR Upper

Since 5.18, one can use regex sets.

/(?[ \p{Greek} & \p{Upper} ])/    # Greek AND Upper

This requires use experimental qw( regex_sets ); before 5.36. But it's safe to add this and use the feature as far back as its introduction as an experimental feature in 5.18, since no change was made to the feature since then.


There are some other approaches that can be used in older versions of Perl, but they are indisputably harder to read.

One way of achieving AND in a regex is using lookarounds.

/\p{Greek}(?<=\p{Upper})/         # Greek AND Upper

Another way of getting an AND is to negate an OR. De Morgan's laws tells us

NOT( Greek AND Upper )  ⇔  NOT(Greek) OR NOT(Upper)

so

Greek AND Upper  ⇔  NOT( NOT(Greek) OR NOT(Upper) )

This gives us

/[^\P{Greek}\P{Upper}]/           # Greek AND Upper

This is more efficient then using a lookbehind.

like image 194
ikegami Avatar answered Nov 12 '22 11:11

ikegami


This works in 5.14.0 as well:

sub InUpperGreek {
    return <<'END'
+utf8::Greek
&utf8::Upper
END
}

my @upper_greek = ($str =~ /\p{InUpperGreek}/g);
say "Upper Greek: @upper_greek";

Not sure if that's simpler. :) For more information on how this works, see the perlunicode documentation on user-defined character properties.

like image 6
Tanktalus Avatar answered Nov 12 '22 11:11

Tanktalus