Here's something I'm trying to do with regular expressions, and I can't figure out how. I have a big file, and strings abc
, 123
and xyz
that appear multiple times throughout the file.
I want a regular expression to match a substring of the big file that begins with abc
, contains 123
somewhere in the middle, ends with xyz
, and there are no other instances of abc
or xyz
in the substring besides the start and the end.
Is this possible with regular expressions?
'?' matches/verifies the zero or single occurrence of the group preceding it. Check Mobile number example. Same goes with '*' . It will check zero or more occurrences of group preceding it.
In that case, a regular expression is (a+b)bbb(a+b). The anatomy of this regular expression is the following: (a+b) gives either "a" or "b" (a+b)* gives any string of "a"s and "b"s whatever.
If an opening bracket or brace is interpreted as a metacharacter, the regular expression engine interprets the first corresponding closing character as a metacharacter. If this is not the desired behavior, the closing bracket or brace should be escaped by explicitly prepending the backslash (\) character.
When your left- and right-hand delimiters are single characters, it can be easily solved with negated character classes. So, if your match is between a
and c
and should not contain b
(literally), you may use (demo)
a[^abc]*c
This is the same technique you use when you want to make sure there is a b
in between the closest a
and c
(demo):
a[^abc]*b[^ac]*c
When your left- and right-hand delimiters are multi-character strings, you need a tempered greedy token:
abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz
See the regex demo
To make sure it matches across lines, use re.DOTALL
flag when compiling the regex.
Note that to achieve a better performance with such a heavy pattern, you should consider unrolling it. It can be done with negated character classes and negative lookaheads.
Pattern details:
abc
- match abc
(?:(?!abc|xyz|123).)*
- match any character that is not the starting point for a abc
, xyz
or 123
character sequences123
- a literal string 123
(?:(?!abc|xyz).)*
- any character that is not the starting point for a abc
or xyz
character sequencesxyz
- a trailing substring xyz
See the diagram below (if re.S
is used, .
will mean AnyChar
):
See the Python demo:
import re
p = re.compile(r'abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz', re.DOTALL)
s = "abc 123 xyz\nabc abc 123 xyz\nabc text 123 xyz\nabc text xyz xyz"
print(p.findall(s))
// => ['abc 123 xyz', 'abc 123 xyz', 'abc text 123 xyz']
Using PCRE a solution would be:
This using m
flag. If you want to check only from start and end of a line add ^
and $
at beginning and end respectively
abc(?!.*(abc|xyz).*123).*123(?!.*(abc|xyz).*xyz).*xyz
Debuggex Demo
The comment by hvd is quite appropriate, and this just provides an example. In SQL, for instance, I think it would be clearer to do:
where val like 'abc%123%xyz' and
val not like 'abc%abc%' and
val not like '%xyz%xyz'
I imagine something quite similar is simple to do in other environments.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With