Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex that matches a word but only if another word doesn't appear?

Tags:

python

regex

I'm usually pretty good with Regex but I'm struggling with this one. I need a regular expression that matches the term cbd but not if the phrase central business district appears anywhere else in the search string. Or if that is too difficult, at least matches cbd if the phrase central business district doesn't appear anywhere before the term cbd. Only the cbd part should be returned as the result, so I'm using lookaheads/lookbehinds, but I have not been able to meet the requirements...

Input examples:
GOOD Any products containing CBD are to be regulated.
BAD    Properties located within the Central Business District (CBD) are to be regulated

I have tried:

  • (?!central business district)cbd
  • (.*(?!central business district).*)cbd

This is in Python 3.6+ using the re module.

I know it would be easy to accomplish with a couple lines of code, but we have a list of regex strings in a database that we are using to search a corpus for documents that contain any one of the regex strings from the DB. It is best to avoid hard-coding any keywords into the scripts because then it would not be clear to our other developers where these matches are coming from because they can't see it in the database.

like image 422
mevers303 Avatar asked Oct 27 '25 04:10

mevers303


1 Answers

Use PyPi regex with

import regex
strings = [' I need a regular expression that matches the term cbd but not if the phrase central business district appears anywhere else in the search string.', 'I need cbd here.']
for s in strings:
  x = regex.search(r'(?<!central business district.*)cbd(?!.*central business district)', s, regex.S)
  if x:
    print(s, x.group(), sep=" => ")

Results: I need cbd here. => cbd. See Python code.

Explanation

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    central business         'central business district'
    district
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  cbd                      'cbd'
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    central business         'central business district'
    district
--------------------------------------------------------------------------------
  )                        end of look-ahead
like image 54
Ryszard Czech Avatar answered Oct 28 '25 18:10

Ryszard Czech