How to combine multiple Unicode properties in perl regex?

Question

Have this script:

use 5.014;
use warnings;

use utf8;    
binmode STDOUT, ':utf8';

my $str = "XYZ ΦΨΩ zyz φψω";

my @greek = ($str =~ /\p{Greek}/g);
say "Greek: @greek";

my @upper = ($str =~ /\p{Upper}/g);
say "Upper: @upper";

#my @upper_greek = ($str =~ /\p{Upper+Greek}/); #wrong.
#say "Upper+Greek: @upper_greek";

Is possible combine multiple unicode properties? E.g how to select only Upper and Greek, and get the wanted:

Greek: Φ Ψ Ω φ ψ ω
Upper: X Y Z Φ Ψ Ω
Upper+Greek: Φ Ψ Ω      #<-- how to get this?

ikegami · Accepted Answer

We want to perform an AND operation, so we can't use

/(?:\p{Greek}|\p{Upper})/         # Greek OR Upper

or

/[\p{Greek}\p{Upper}]/            # Greek OR Upper

Since 5.18, one can use regex sets.

/(?[ \p{Greek} & \p{Upper} ])/    # Greek AND Upper

This requires use experimental qw( regex_sets ); before 5.36. But it's safe to add this and use the feature as far back as its introduction as an experimental feature in 5.18, since no change was made to the feature since then.

There are some other approaches that can be used in older versions of Perl, but they are indisputably harder to read.

One way of achieving AND in a regex is using lookarounds.

/\p{Greek}(?<=\p{Upper})/         # Greek AND Upper

Another way of getting an AND is to negate an OR. De Morgan's laws tells us

NOT( Greek AND Upper )  ⇔  NOT(Greek) OR NOT(Upper)

so

Greek AND Upper  ⇔  NOT( NOT(Greek) OR NOT(Upper) )

This gives us

/[^\P{Greek}\P{Upper}]/           # Greek AND Upper

This is more efficient then using a lookbehind.

Tanktalus · Answer

This works in 5.14.0 as well:

sub InUpperGreek {
    return <<'END'
+utf8::Greek
&utf8::Upper
END
}

my @upper_greek = ($str =~ /\p{InUpperGreek}/g);
say "Upper Greek: @upper_greek";

Not sure if that's simpler. :) For more information on how this works, see the perlunicode documentation on user-defined character properties.

How to combine multiple Unicode properties in perl regex?

Tags:

unicode

perl

jm666

2 Answers

ikegami

Tanktalus

Recent Activity

Donate For Us

How to combine multiple Unicode properties in perl regex?

Tags:

unicode

perl

jm666

2 Answers

ikegami

Tanktalus

Related questions

Recent Activity

Donate For Us