Here's something I'm trying to do with regular expressions, and I can't figure out how. I have a big file, and strings <code>abc</code>, <code>123</code> and <code>xyz</code> that appear multiple times throughout the file. I want a regular expression to match a substring of the big file that begins with <code>abc</code>, contains <code>123</code> somewhere in the middle, ends with <code>xyz</code>, and there are no other instances of <code>abc</code> or <code>xyz</code> in the substring besides the start and the end. Is this possible with regular expressions?

Using PCRE a solution would be: This using <code>m</code> flag. If you want to check only from start and end of a line add <code>^</code> and <code>$</code> at beginning and end respectively <pre class="prettyprint"><code>abc(?!.*(abc|xyz).*123).*123(?!.*(abc|xyz).*xyz).*xyz </code></pre> <img src="https://www.debuggex.com/i/RC1zd10M7X9O7kGo.png" alt="Regular expression visualization"> Debuggex Demo

The comment by hvd is quite appropriate, and this just provides an example. In SQL, for instance, I think it would be clearer to do: <pre class="prettyprint"><code>where val like 'abc%123%xyz' and val not like 'abc%abc%' and val not like '%xyz%xyz' </code></pre> I imagine something quite similar is simple to do in other environments.

Regular expressions: Ensuring b doesn't come between a and c

Tags:

regex

python-2.7

Here's something I'm trying to do with regular expressions, and I can't figure out how. I have a big file, and strings abc, 123 and xyz that appear multiple times throughout the file.

I want a regular expression to match a substring of the big file that begins with abc, contains 123 somewhere in the middle, ends with xyz, and there are no other instances of abc or xyz in the substring besides the start and the end.

Is this possible with regular expressions?

306

asked May 15 '16 15:05

Ram Rachum

3 Answers

When your left- and right-hand delimiters are single characters, it can be easily solved with negated character classes. So, if your match is between a and c and should not contain b (literally), you may use (demo)

a[^abc]*c

This is the same technique you use when you want to make sure there is a b in between the closest a and c (demo):

a[^abc]*b[^ac]*c

When your left- and right-hand delimiters are multi-character strings, you need a tempered greedy token:

abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz

See the regex demo

To make sure it matches across lines, use re.DOTALL flag when compiling the regex.

Note that to achieve a better performance with such a heavy pattern, you should consider unrolling it. It can be done with negated character classes and negative lookaheads.

Pattern details:

abc - match abc
(?:(?!abc|xyz|123).)* - match any character that is not the starting point for a abc, xyz or 123 character sequences
123 - a literal string 123
(?:(?!abc|xyz).)* - any character that is not the starting point for a abc or xyz character sequences
xyz - a trailing substring xyz

See the diagram below (if re.S is used, . will mean AnyChar):

enter image description here

See the Python demo:

import re
p = re.compile(r'abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz', re.DOTALL)
s = "abc 123 xyz\nabc abc 123 xyz\nabc text 123 xyz\nabc text xyz xyz"
print(p.findall(s))
// => ['abc 123 xyz', 'abc 123 xyz', 'abc text 123 xyz']

answered Oct 11 '22 17:10

Wiktor Stribiżew

Using PCRE a solution would be:

This using m flag. If you want to check only from start and end of a line add ^ and $ at beginning and end respectively

abc(?!.*(abc|xyz).*123).*123(?!.*(abc|xyz).*xyz).*xyz

Regular expression visualization

Debuggex Demo

answered Oct 11 '22 18:10

Jorge Campos

The comment by hvd is quite appropriate, and this just provides an example. In SQL, for instance, I think it would be clearer to do:

where val like 'abc%123%xyz' and
      val not like 'abc%abc%' and
      val not like '%xyz%xyz'

I imagine something quite similar is simple to do in other environments.

answered Oct 11 '22 17:10

Gordon Linoff

Related questions
                            
                                R: Capitalizing everything after a certain character
                            
                                Regex - Remove everything from a string that does not match an expression
                            
                                Regex for Mobile Number with or without Country Code
                            
                                regular expression to match everything until the last occurrence of /
                            
                                How do I replace an asterisk in Javascript using replace()?
                            
                                How to Match with Regex "shortest match" in .NET
                            
                                regex to allow atleast one special character, one uppercase, one lowercase(in any order)
                            
                                How do you comment a Perl regular expression?
                            
                                using the jquery validation plugin, how can I add a regex validation on a textbox?
                            
                                How to detect a floating point number using a regular expression
                            
                                Invalid escape sequence \d
                            
                                How to convert Markdown-style links using regex?
                            
                                Java String ReplaceAll and ReplaceFirst Fails at $ Symbol at Replacement Text
                            
                                Extracting decimal numbers from a string
                            
                                Regex for Money
                            
                                camelCase to dash - two capitals next to each other
                            
                                How to get text between nested parentheses?
                            
                                Remove new line characters from data recieved from node event process.stdin.on("data")
                            
                                Regex only capture first match [duplicate]
                            
                                Java String.replaceAll() with back reference

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With