Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex negation within regex

Tags:

python

regex

Given:

ABC
content 1
123
content 2
ABC
content 3
XYZ

Is it possible to create a regex that matches the shortest version of "ABC[\W\w]+?XYZ"

Essentially, I'm looking for "ABC followed by any characters terminating with XYZ, but don't match if I encounter ABC in between" (but think of ABC as a potential regex itself, as it would not always be a set length...so ABC or ABcC could also match)

So, more generally: REGEX1 followed by any character and terminated by REGEX2, not matching if REGEX1 occurs in between.

In this example, I would not want the first 4 lines.

(I'm sure this explanation could potentially need...further explanation haha)

EDIT:

Alright, I see the need for further explanation now! Thanks for the suggestions thus far. I'll at least give you all more to think about while I start looking into how each of your proposed solutions can be applied to my problem.

Proposal 1: Reverse the string contents and the regex.

This is certainly a very fun hack that solves the problem based on what I explained. In simplifying the issue, I failed to also mention that the same thing could happen in reverse because the ending signature could exist later on also (and has proven to be in my specific situation). That introduces the problem illustrated below:

ABC
content 1
123
content 2
ABC
content 3
XYZ
content 4
MNO
content 5
XYZ

In this instance, I would check for something like "ABC through XYZ" meaning to catch [ABC, content 1, XYZ]...but accidentally catching [ABC, content 1, 123, content 2, ABC, content 3, XYZ]. Reversing that would catch [ABC, content 3, XYZ, content 4, MNO, content 5, XYZ] instead of the [ABC, content 2, XYZ] that we want again. The point is to try to make it as generalized as possible because I will also be searching for things that could potentially have the same starting signature (regex "ABC" in this case), and different ending signatures.

If there is a way to build the regexes so that they encapsulate this sort of limitation, it could prove much easier to just reference that any time I build a regex to search for in this type of string, rather than creating a custom search algorithm that deals with it.

Proposal 2: A+B+C+[^A]+[^B]+[^C]+XYZ with IGNORECASE flag

This seems nice in the case that ABC is finite. Think of it as a regex in itself though. For example:

Hello!GoodBye!Hello.Later.

VERY simplified version of what I'm trying to do. I would want "Hello.Later." given the start regex Hello[!.] and the end Later[!.]. Running something simply like Hello[!.]Later[!.] would grab the entire string, but I'm looking to say that if the start regex Hello[!.] exists between the first starting regex instance found and the first ending regex instance found, ignore it.

The convo below this proposal indicates that I might be limited by regular language limitations similar to the parentheses matching problem (Google it, it's fun to think about). The purpose of this post is to see if I do in fact have to resort to creating an underlying algorithm that handles the issue I'm encountering. I would very much like to avoid it if possible (in the simple example that I gave you above, it's pretty easy to design a finite state machine for...I hope that holds as it grows slightly more complex).

Proposal 3: ABC(?:(?!ABC).)*?XYZ with DOTALL flag

I like the idea of this if it actually allows ABC to be a regex. I'll have to explore this when I get in to the office tomorrow. Nothing looks too out of the ordinary at first glance, but I'm entirely new to python regex (and new to actually applying regexes in code instead of just in theory homework)

like image 505
aheuertz Avatar asked Feb 21 '23 04:02

aheuertz


1 Answers

A regex solution would be ABC(?:(?!ABC).)*?XYZ with the DOTALL flag.

like image 123
MRAB Avatar answered Feb 25 '23 14:02

MRAB