There is a very complex regular expression.
But I have a problem with it. The #
and ++
characters are removed if there are letters after them.
Question: How to add an exception to current regex for (C++ and C# tokens)?
I've used the next regex:
import re
text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
text = re.sub(r'[!,.:;—](?= |$)', ' ', text)
print(re.sub(r'(?i)(?:(?!\.net\b|\b-\b)[^\w\s])+(?=[^\w\s]*\b)', ' ', text))
And I've had the next result:
'Must-have skills .Net programming experience 2 years experience in C++ C .Net C .Net C .Net '
Desired result:
'Must-have skills .Net programming experience 2 years experience in C++ C# .Net C++ .Net C .Net '
Current regex details
(?i)
- case insensitive mode on(?:(?!\.net\b|\b-\b)[^\w\s])+
- any punctuation char ([^\w\s]
), 1 or more occurrences, as many as possible, that does not start any of the sequences:
\.net\b
- .net
as whole word|
- or \b-\b
- a hyphen enclosed with word chars(?=[^\w\s]*\b)
- a positive lookahead that requires 0+ punctuation chars followed with a word boundary position immediately to the right of the current location.Edit
#1
Same as below but much shorter, I'm defining the characters that must precede the captured ones all in one set
>>> import re
>>> text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
>>> re.sub('(?:(?<!\S)|(?<=[\s\+\.C#]))[\-!,.:;—/]|[\-!,.:;—/](?=\s|$)', ' ', text)
#Output
'Must-have skills .Net programming experience 2 years experience in C++ C# .Net C++ .Net C .Net '
.
Explanation
.
#2
Kind of a really dirty solution but
Will post an explanation later; might even refine it for better readability
>>> import re
>>> text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
>>> re.sub('(?:(?<!\S)|(?<=\s)|(?<=\+)|(?<=\.)|(?<=C)|(?<=#))[\-!,.:;—/]|[\-!,.:;—/](?=\s|$)', ' ', text)
#Output
'Must-have skills .Net programming experience 2 years experience in C++ C# .Net C++ .Net C .Net '
.
Edit: Explanation
(?:
I am opening by saying the query that I want to capture should in this case be preceded by the capture set which contains whatever is defined immediately behind (?:
.(?<!
and (?<=
cannot be set to ignore a range of values so I have to first start with (?:
and then give multiple (?<!
's and (?<=
's to say what's captured should or should NOT be preceded by this character, and NOT be preceded by this other character, and so on and so forth
(?:
I am now able to set the values that what is captured should be or should not be preceded by (?<!\S)
it really is unneeded but I included it because it casts a safety net. It basically says the range [\-!,.:;—/]
should NOT be captured/acted on if it is preceded by any random non-whitespace character
|(?<=\s)
I am saying *OR [\-!,.:;—/]
should be captured/acted on if it is preceded by any single whitespace character|(?<=\+)|(?<=\.)|(?<=C)
I'm saying OR [\-!,.:;—/]
should be captured/acted on if it is preceded by +, ., or C, so the \. OR just . [a period]
in [\-!,.:;—/]
will be capture/acted upon if it is preceded by C
like in your string (remember (?<=C)
); and ;
in [\-!,.:;—/]
will be captured/acted upon if it is preceded by +
(remember (?<=\+)
). )
before the |
closes the (?:
.|
as you know is OR, and because I can't make the statement an all in one, I have to redefine [\-!,.:;—/]
then make a lookahead to say, Capture/act on [\-!,.:;—/]
if it is followed by whitespace or the end of the string. With lookaheads, you're able to define regular single string type 'ranges' so you can actually use OR statements
within them but you cannot when you use lookaheadsIt's not quite the same as your output but I was able to do this with only a difference of white space by reversing the order of the two re.sub
s and adding a negative lookbehind.
text = 'Must-have skills: -.Net programming experience; -2 years experience in C++; C#/.Net, C++/.Net, C./.Net.'
text = re.sub(r'(?i)(?:(?!\.net\b|\b-\b)(?<!C)(?<!C\+)[^\w\s])+(?=[^\w\s]*\b)', ' ', text)
text = re.sub('[!,.:;—](?= |$)', ' ', text)
Output:
print(text)
Must-have skills .Net programming experience 2 years experience in C++ C# .Net C++ .Net C .Net
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With