Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I find a string that contains any between two regular expressions except for a certain regex in python?

Tags:

python

regex

I'm trying to write a regular expression to sift through 3mb of text and find certain strings. Right now it works relatively well, except for one problem.

The current expression I'm using is

pattern = re.compile(r'[A-Z]{4} \d{3}.{4,40} \(\d\)')

This effectively searches through the enormous string and finds all occurences of 4 uppercase aplha characters followed by a space, followed by 3 numbers followed by 4-40 any kind of characters, followed by a space, followed by (n) where n is any number.

What I'm looking for is something like ACCT 220 Principles of Accounting I (3)

This is exactly what I want, except that it sometimes catches the pattern too early. There are some occurrences in the document that one class will precede the class where the pattern is supposed to start. For example I'll end up with BMGT 310.ACCT 220 Principles of Accounting I (3)

I figured one way to get around this would be to not allow patterns to contain 4 upper case letters in the .{4,40} portion of the regular expression. I've tried using ^ to no avail.

For example I tried something along the lines of [A-Z]{4} \d{3}([^A-Z]{4}){4,40} \(\d\) but then I end up with an empty list since the expression didn't find anything.

I'm thinking that I just don't understand the syntax of regex so much yet. If anyone knows how to fix my expression so that it will find all instances of 4 upper case letters followed by a space, followed by three numbers, followed by 4-40 any kind of characters that do NOT contain 4 capital letters in a row, followed by a space, followed by (n) where n is a number, that would be awesome and greatly appreciated.

I understand this question might be rather confusing. If you need any more information from me, please let me know.

like image 380
Troy Kent Avatar asked Oct 02 '22 00:10

Troy Kent


1 Answers

If you don't want to match 4 uppercases in a row, you can instead make use of a negative lookahead, and then match 1 character at a time with {4,40}:

Piece of your current working regex:

.{4,40}

To be changed to:

(?:(?![A-Z]{4}).){4,40}

regex101 demo

A negative lookahead (?! ... ) will make a match fail if what's inside it matches. Since we have (?![A-Z]{4}), the match will fail if there are 4 uppercase in a row. They are zero-width assertions, such that the final match won't be affected at all, and also why I'm still using a . for the main matching.


A simple example which might help explain how negative lookahead work and how to understand the zero-width assertion is this:

w(?!o)

This regex will match the w (see that no o is involved) in way, whole, below but not the w in word.

(?![A-Z]{4}). will thus match ., unless this . is an uppercase character followed by 3 more uppercase character (making this a 4 uppercase consecutive).

To repeat this . now, you cannot just use (?![A-Z]{4}).{4,40} because the negative lookahead will only apply to the first . and not the others. The trick is thus to put (?![A-Z]{4}). in a group and then repeat:

((?![A-Z]{4}).){4,40}

Last, I prefer using non-capture groups (?: ... ) because they make the regex a bit more efficient since they don't store captures:

(?:(?![A-Z]{4}).){4,40}
like image 129
Jerry Avatar answered Oct 12 '22 23:10

Jerry