Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Raku/Perl6: How to restrict match method to capture group?

Tags:

regex

raku

I am attempting to match three letters from a file name with the 1000Genomes project, and three letters only, from strings like ethnicity_lists/PEL.txt I should only get PEL. The rest of the string is irrelevant.

my $p1-label = @populations[$p1-index].match(/^ethnicity_lists\/(<[A..Y]>)**3\.txt$/);

The problem is that $p1-label includes the entire string beyond the capture group.

I have put the parentheses around <[A..Y]> to emphasize that I only want that group.

Looking through https://docs.perl6.org/routine/match

I try to be as specific as possible to prevent any possible errors, which is why I include the entire string.

If I do the Perl5-style match:

if @populations[$p1-index] ~~ /^ethnicity_lists\/(<[A..Y]>)**3\.txt$/ {
    put $0.join(''); # strange that this outputs an array instead of a string
}

I've tried all of the adverbs for the match method but none do the necessary job.

How can I restrict a match method to only the capture group in the regex?

like image 264
con Avatar asked Dec 01 '22 13:12

con


2 Answers

The match method returns a Match object that comprises all the information about your match. If you do :

my $p1-label = @populations[$p1-index].match(/^ethnicity_lists\/(<[A..Y]>)**3\.txt$/);
say $p1-label;

You'll see it includes 3 items flagged as 0 because of the mentioned **3 outside the brackets :

「ethnicity_lists/PEL.txt」
 0 => 「P」
 0 => 「E」
 0 => 「L」

Getting the Str representation of the Match object gives you the complete match. But you can also ask for it's [0] index.

say  say $p1-label[0]'
[「P」 「E」 「L」]

Lets fix the regular expression to put the quantifier in the brackets and see what we get.

my $p1-label = @populations[$p1-index].match(/^ethnicity_lists\/(<[A..Y]>**3)\.txt$/);
say $p1-label;
「ethnicity_lists/PEL.txt」
 0 => 「PEL」

Looking better. Now if you only want the PEL bit you've got two options. You can just get the Str representation of the first item in the match :

my $p1-label = @populations[$p1-index].match(/^ethnicity_lists\/(<[A..Y]>**3)\.txt$/)[0].Str;
say $p1-label;
PEL

Note if I don't coerce it to a String I get the match object of the sub match. (Which can be useful but not what you need).

Or you can use Zero Width assertions and skip the capturing altogether :

my $p1-label = @populations[$p1-index].match(/<?after ^ethnicity_lists\/><[A..Y]>**3<?before \.txt$>/).Str;
say $p1-label;
PEL

Here we are matching 3 upper case letters that occur after the expression ^ethnicity_lists\/ and before \.txt$ but they aren't included in the match itself.

Or as pointed out by @raiph you can use a double capture to tell the system this is the only bit you want :

my $p1-label = @populations[$p1-index].match(/^ethnicity_lists\/<(<[A..Y]>**3)>\.txt$/)[0].Str;
say $p1-label;
PEL

This last one is probably best.

Hope that helps.

like image 158
Scimon Proctor Avatar answered Dec 16 '22 00:12

Scimon Proctor


@Holli's answer makes a key point and @Scimon's digs in deeper about why you got the result you got but...

If you doubly emphasize what part you want with <( ... )> instead of just ( ... ) it makes just that part become the overall capture object.

And if you use put instead of say you get the machine friendly stringification (same as .Str, so in this case PEL) instead of the human friendly stringification (same as .gist, so in this case it would have been 「PEL」):

put 'fooPELbar' ~~ / foo  ( ... )  bar /; # fooPELbar
put 'fooPELbar' ~~ / foo <( ... )> bar /; # PEL
like image 20
raiph Avatar answered Dec 15 '22 22:12

raiph