Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count the number of matches of a particular character in a string matched by a regex wildcard

Can I keep a count of each different character matched in the regex itself ?

Suppose the regex goes looks like />(.*)[^a]+/

Can I keep a count of the occurrences of, say the letter p in the string captured by the group (.*)?

like image 793
Gil Avatar asked Aug 10 '12 14:08

Gil


People also ask

What does * do in regex?

The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.

What does ?= Mean in regular expression?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

What can be matched using (*) in a regular expression?

A regular expression followed by an asterisk ( * ) matches zero or more occurrences of the regular expression. If there is any choice, the first matching string in a line is used.

What is the difference between the and * characters in regular expressions?

Each of them are quantifiers, the star quantifier( * ) means that the preceding expression can match zero or more times it is like {0,} while the plus quantifier( + ) indicate that the preceding expression MUST match at least one time or multiple times and it is the same as {1,} .


2 Answers

You would have to capture the string matched and process it separately.

This code demonstrates

use strict;
use warnings;

my $str = '> plantagenetgoosewagonattributes';

if ($str =~ />(.*)[^a]+/) {
  my $substr = $1;
  my %counts;
  $counts{$_}++ for $substr =~ /./g;
  print "'$_' - $counts{$_}\n" for sort keys %counts;
}

output

' ' - 1
'a' - 4
'b' - 1
'e' - 4
'g' - 3
'i' - 1
'l' - 1
'n' - 3
'o' - 3
'p' - 1
'r' - 1
's' - 1
't' - 5
'u' - 1
'w' - 1
like image 114
Borodin Avatar answered Oct 12 '22 23:10

Borodin


Outside of the regex :

my $p_count = map /p/g, />(.*)[^a]/;

Self-contained:

local our $p_count;
/
   (?{ 0 })
   >
   (?: p (?{ $^R + 1 })
   |   [^p]
   )*
   [^a]
   (?{ $p_count = $^R; })
/x;

In both cases, you can easily expand this to count all letters. For example,

my %counts;
if (my ($seq = />(.*)[^a]/) {
   ++$counts{$_} for split //, $seq;
}

my $p_count = $counts{'p'};
like image 40
ikegami Avatar answered Oct 12 '22 23:10

ikegami