Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression matching all but a string

Tags:

python

regex

I need to find all the strings matching a pattern with the exception of two given strings.

For example, find all groups of letters with the exception of aa and bb. Starting from this string:

-a-bc-aa-def-bb-ghij-

Should return:

('a', 'bc', 'def', 'ghij')

I tried with this regular expression that captures 4 strings. I thought I was getting close, but (1) it doesn't work in Python and (2) I can't figure out how to exclude a few strings from the search. (Yes, I could remove them later, but my real regular expression does everything in one shot and I would like to include this last step in it.)

I said it doesn't work in Python because I tried this, expecting the exact same result, but instead I get only the first group:

>>> import re
>>> re.search('-(\w.*?)(?=-)', '-a-bc-def-ghij-').groups()
('a',)

I tried with negative look ahead, but I couldn't find a working solution for this case.

like image 274
stenci Avatar asked Feb 07 '23 02:02

stenci


2 Answers

You can make use of negative look aheads.

For example,

>>> re.findall(r'-(?!aa|bb)([^-]+)', string)
['a', 'bc', 'def', 'ghij']

  • - Matches -

  • (?!aa|bb) Negative lookahead, checks if - is not followed by aa or bb

  • ([^-]+) Matches ony or more character other than -


Edit

The above regex will not match those which start with aa or bb, for example like -aabc-. To take care of that we can add - to the lookaheads like,

>>> re.findall(r'-(?!aa-|bb-)([^-]+)', string)
like image 116
nu11p01n73R Avatar answered Feb 08 '23 15:02

nu11p01n73R


You need to use a negative lookahead to restrict a more generic pattern, and a re.findall to find all matches.

Use

res = re.findall(r'-(?!(?:aa|bb)-)(\w+)(?=-)', s)

or - if your values in between hyphens can be any but a hyphen, use a negated character class [^-]:

res = re.findall(r'-(?!(?:aa|bb)-)([^-]+)(?=-)', s)

Here is the regex demo.

Details:

  • - - a hyphen
  • (?!(?:aa|bb)-) - if there is aaa- or bb- after the first hyphen, no match should be returned
  • (\w+) - Group 1 (this value will be returned by the re.findall call) capturing 1 or more word chars OR [^-]+ - 1 or more characters other than -
  • (?=-) - there must be a - after the word chars. The lookahead is required here to ensure overlapping matches (as this hyphen will be a starting point for the next match).

Python demo:

import re
p = re.compile(r'-(?!(?:aa|bb)-)([^-]+)(?=-)')
s = "-a-bc-aa-def-bb-ghij-"
print(p.findall(s)) # => ['a', 'bc', 'def', 'ghij']
like image 43
Wiktor Stribiżew Avatar answered Feb 08 '23 15:02

Wiktor Stribiżew