Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to match smallest only

Tags:

regex

perl

pcre

I have a expression like c.{0,2}?mand a string like "abcemtcmncefmf". Currently it will matches three substrings: cem, cm and cefm (see here). But I like to match only the smallest of this, in this case, cm.

My problem is that I don't have a global match support, only the first match, because I'm using MariaDB REGEXP_SUBSTR() function. My current solution is a stored procedure that I created to solve my problem. But it is 10 times slower than just a regular expression for simple cases.

I too tried do something like: (cm|c.{0,1}?m|c.{0,2}?m), but it doesn't worked because it will match first of any group patterns, instead of try one by one in all subject string.

I know that regular expressions (PCRE) have some black magic features, but I don't found nothing to solve my problem.


  • Note: I'm yet using non-greedy pattern (.{0,2}?) on my current pattern;
  • The question Regular expression to find smallest possible match don't is my problem;
like image 604
David Rodrigues Avatar asked Jan 29 '16 18:01

David Rodrigues


People also ask

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .

How do you match a regular expression?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What do you use in a regular expression to match any 1 character or space?

Use square brackets [] to match any characters in a set. Use \w to match any single alphanumeric character: 0-9 , a-z , A-Z , and _ (underscore). Use \d to match any single digit. Use \s to match any single whitespace character.

What is whitespace regular expression?

The most common regex character to find whitespaces are \s and \s+ . The difference between these regex characters is that \s represents a single whitespace character while \s+ represents multiple whitespaces in a string.


1 Answers

You can simply use an alternation in a branch reset group:

/^(?|.*(cm)|.*(c.m)|.*(c..m))/s

(The result is in group 1)

or like this:

/^.*\Kcm|^.*\Kc.m|^.*\Kc..m/s

The first successful branch wins.

like image 168
Casimir et Hippolyte Avatar answered Sep 22 '22 07:09

Casimir et Hippolyte