Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add exception to complicated regex

Tags:

python

regex

There is a very complex regular expression.

But I have a problem with it. The # and ++ characters are removed if there are letters after them.

Question: How to add an exception to current regex for (C++ and C# tokens)?

I've used the next regex:

import re

text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
text = re.sub(r'[!,.:;—](?= |$)', ' ', text)
print(re.sub(r'(?i)(?:(?!\.net\b|\b-\b)[^\w\s])+(?=[^\w\s]*\b)', ' ', text))

And I've had the next result:

'Must-have skills   .Net programming experience   2 years experience in C++  C .Net  C .Net  C .Net '

Desired result:

'Must-have skills   .Net programming experience   2 years experience in C++  C# .Net  C++ .Net  C .Net '

Current regex details

  • (?i) - case insensitive mode on
  • (?:(?!\.net\b|\b-\b)[^\w\s])+ - any punctuation char ([^\w\s]), 1 or more occurrences, as many as possible, that does not start any of the sequences:
    • \.net\b - .net as whole word
    • | - or
    • \b-\b - a hyphen enclosed with word chars
  • (?=[^\w\s]*\b) - a positive lookahead that requires 0+ punctuation chars followed with a word boundary position immediately to the right of the current location.
like image 472
lemon Avatar asked Jan 26 '23 14:01

lemon


2 Answers

Edit

#1

Same as below but much shorter, I'm defining the characters that must precede the captured ones all in one set

>>> import re

>>> text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'

>>> re.sub('(?:(?<!\S)|(?<=[\s\+\.C#]))[\-!,.:;—/]|[\-!,.:;—/](?=\s|$)', ' ', text)


#Output
'Must-have skills   .Net programming experience   2 years experience in C++  C# .Net  C++ .Net  C  .Net '

.

Explanation

  • The answer here is effectively the same as the one that follows below but instead of declaring the characters that must precede the captured set that will be acted upon one by one, this I defines them all in one set.

.

#2

Kind of a really dirty solution but

Will post an explanation later; might even refine it for better readability

>>> import re

>>> text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'

>>> re.sub('(?:(?<!\S)|(?<=\s)|(?<=\+)|(?<=\.)|(?<=C)|(?<=#))[\-!,.:;—/]|[\-!,.:;—/](?=\s|$)', ' ', text)


#Output
'Must-have skills   .Net programming experience   2 years experience in C++  C# .Net  C++ .Net  C  .Net '

.

Edit: Explanation

  • So by opening with (?: I am opening by saying the query that I want to capture should in this case be preceded by the capture set which contains whatever is defined immediately behind (?:.
  • The key here is that the lookaheads which start with (?<! and (?<= cannot be set to ignore a range of values so I have to first start with (?: and then give multiple (?<!'s and (?<='s to say what's captured should or should NOT be preceded by this character, and NOT be preceded by this other character, and so on and so forth
  • So having opened with (?: I am now able to set the values that what is captured should be or should not be preceded by
  • Starting with (?<!\S) it really is unneeded but I included it because it casts a safety net. It basically says the range [\-!,.:;—/] should NOT be captured/acted on if it is preceded by any random non-whitespace character
  • With |(?<=\s) I am saying *OR [\-!,.:;—/] should be captured/acted on if it is preceded by any single whitespace character
  • With |(?<=\+)|(?<=\.)|(?<=C) I'm saying OR [\-!,.:;—/] should be captured/acted on if it is preceded by +, ., or C, so the \. OR just . [a period] in [\-!,.:;—/] will be capture/acted upon if it is preceded by C like in your string (remember (?<=C)); and ; in [\-!,.:;—/] will be captured/acted upon if it is preceded by + (remember (?<=\+)).
  • The final ) before the | closes the (?:.
  • | as you know is OR, and because I can't make the statement an all in one, I have to redefine [\-!,.:;—/] then make a lookahead to say, Capture/act on [\-!,.:;—/] if it is followed by whitespace or the end of the string. With lookaheads, you're able to define regular single string type 'ranges' so you can actually use OR statements within them but you cannot when you use lookaheads
like image 92
FailSafe Avatar answered Feb 08 '23 16:02

FailSafe


It's not quite the same as your output but I was able to do this with only a difference of white space by reversing the order of the two re.subs and adding a negative lookbehind.

text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
text = re.sub(r'(?i)(?:(?!\.net\b|\b-\b)(?<!C)(?<!C\+)[^\w\s])+(?=[^\w\s]*\b)', ' ', text)
text = re.sub('[!,.:;—](?= |$)', ' ', text)

Output:

print(text)
Must-have skills   .Net programming experience   2 years experience in C++  C# .Net  C++ .Net  C  .Net 
like image 41
CT Hall Avatar answered Feb 08 '23 16:02

CT Hall