Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expressions: Ensuring b doesn't come between a and c

Here's something I'm trying to do with regular expressions, and I can't figure out how. I have a big file, and strings abc, 123 and xyz that appear multiple times throughout the file.

I want a regular expression to match a substring of the big file that begins with abc, contains 123 somewhere in the middle, ends with xyz, and there are no other instances of abc or xyz in the substring besides the start and the end.

Is this possible with regular expressions?

like image 306
Ram Rachum Avatar asked May 15 '16 15:05

Ram Rachum


People also ask

What is '?' In regular expression?

'?' matches/verifies the zero or single occurrence of the group preceding it. Check Mobile number example. Same goes with '*' . It will check zero or more occurrences of group preceding it.

What is BBB regular expression?

In that case, a regular expression is (a+b)bbb(a+b). The anatomy of this regular expression is the following: (a+b) gives either "a" or "b" (a+b)* gives any string of "a"s and "b"s whatever.

How do you escape braces in regular expression?

If an opening bracket or brace is interpreted as a metacharacter, the regular expression engine interprets the first corresponding closing character as a metacharacter. If this is not the desired behavior, the closing bracket or brace should be escaped by explicitly prepending the backslash (\) character.


3 Answers

When your left- and right-hand delimiters are single characters, it can be easily solved with negated character classes. So, if your match is between a and c and should not contain b (literally), you may use (demo)

a[^abc]*c

This is the same technique you use when you want to make sure there is a b in between the closest a and c (demo):

a[^abc]*b[^ac]*c

When your left- and right-hand delimiters are multi-character strings, you need a tempered greedy token:

abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz

See the regex demo

To make sure it matches across lines, use re.DOTALL flag when compiling the regex.

Note that to achieve a better performance with such a heavy pattern, you should consider unrolling it. It can be done with negated character classes and negative lookaheads.

Pattern details:

  • abc - match abc
  • (?:(?!abc|xyz|123).)* - match any character that is not the starting point for a abc, xyz or 123 character sequences
  • 123 - a literal string 123
  • (?:(?!abc|xyz).)* - any character that is not the starting point for a abc or xyz character sequences
  • xyz - a trailing substring xyz

See the diagram below (if re.S is used, . will mean AnyChar):

enter image description here

See the Python demo:

import re
p = re.compile(r'abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz', re.DOTALL)
s = "abc 123 xyz\nabc abc 123 xyz\nabc text 123 xyz\nabc text xyz xyz"
print(p.findall(s))
// => ['abc 123 xyz', 'abc 123 xyz', 'abc text 123 xyz']
like image 76
Wiktor Stribiżew Avatar answered Oct 11 '22 17:10

Wiktor Stribiżew


Using PCRE a solution would be:

This using m flag. If you want to check only from start and end of a line add ^ and $ at beginning and end respectively

abc(?!.*(abc|xyz).*123).*123(?!.*(abc|xyz).*xyz).*xyz

Regular expression visualization

Debuggex Demo

like image 3
Jorge Campos Avatar answered Oct 11 '22 18:10

Jorge Campos


The comment by hvd is quite appropriate, and this just provides an example. In SQL, for instance, I think it would be clearer to do:

where val like 'abc%123%xyz' and
      val not like 'abc%abc%' and
      val not like '%xyz%xyz'

I imagine something quite similar is simple to do in other environments.

like image 2
Gordon Linoff Avatar answered Oct 11 '22 17:10

Gordon Linoff