Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to modify text that matches a particular regular expression in Python?

I need to mark negative contexts in a sentence. The algorithm goes as follows:

  1. Detect a negator (not/never/ain't/don't/ etc)
  2. Detect a clause ending punctuation (.;:!?)
  3. Add _NEG to all the words in between this.

Now, I have defined a regex to pick out all such occurences:

def replacenegation(text):
    match=re.search(r"((\b(never|no|nothing|nowhere|noone|none|not|havent|hasnt|hadnt|cant|couldnt|shouldnt|wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint)\b)|\b\w+n't\b)((?![.:;!?]).)*[.:;!?\b]", text)
    if match:
        print s
        wlist=re.split(r"[.:;!? ]" , s)
        print wlist
        for w in wlist:
            if w:
                news=news+" "+w+"_NEG"
        print news

I can detect and replace the matched group. However, I don't know how to recreate the complete sentence after this operation. Also for multiple matches, match.groups() gives me wrong output.

For example, if my input sentence is:

I don't like you at all; I should not let you know my happiest secret.

Output should be:

I don't like_NEG you_NEG at_NEG all_NEG ; I should not let_NEG you_NEG know_NEG my_NEG happiest_NEG secret_NEG .

How do I do this?

like image 264
Avijit Avatar asked Jan 01 '16 08:01


1 Answers

First of all you better to change the negative look-ahead (?![.:;!?]).)* to a negated character class.


Then you need to use none capture group and remove the extra ones for your negative words because you have surrounded it by 3 capture group, it will returns 3 match of your negative words like not. then you can use re.findall() to find all the matches:

>>> regex =re.compile(r"((?:never|no|nothing|nowhere|noone|none|not|havent|hasnt|hadnt|cant|couldnt|shouldnt|wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint)\b|\b\w+n't\b)([^.:;!?]*)([.:;!?\b])")
>>> regex.findall(s)
[("don't", ' like you at all', ';'), ('not', ' let you know my happiest secret', '.')]

Or for replacing the words you can use re.sub with a lambda function as the replacer:

>>> regex.sub(lambda x:x.group(1)+' '+' '.join([i+'_NEG' for i in x.group(2).split()])+x.group(3) ,s)
"I don't like_NEG you_NEG at_NEG all_NEG; I should not let_NEG you_NEG know_NEG my_NEG happiest_NEG secret_NEG."

Note that for capturing the punctuation you need to put it to a capture group too. Then you can add it at the end of your sentences in re.sub() after edit.

like image 182
Mazdak Avatar answered Sep 26 '22 19:09
