Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get regex group with fuzziness

I have a very big list of words (around 200k):

["cat", "the dog", "elephant", "the angry tiger"]

I created this regex, with fuzziness :

regex = "(cat){e<3}|(the dog){e<3}|(elephant){e<3}|(the angry tiger){e<3}"

I have input sentences :

sentence1 = "The doog is running in the field"
sentence2 = "The elephent and the kat"
...

What I want to get is this :

res1 = ["the dog"]
res2 = ["elephant", "cat"]

I tried this for example:

re.findall(regex, sentence2, flags=re.IGNORECASE|re.UNICODE) 

But this outputs me :

["elephent", "kat"]

Any idea how to get the right answer with corrected words ? What I want is to get the regex capturing group for each match but I struggle to do so.

Maybe I'm not doing this right and maybe the regex way is not the good one but the if item in list with a for loop is way too long to execute.

like image 364
Mohamed AL ANI Avatar asked Apr 24 '18 12:04

Mohamed AL ANI


1 Answers

It can be done by manually constructing the regex and naming the groups:

import regex as re

a = ["cat", "the dog", "elephant", "the angry tiger"]
a_dict = { 'g%d' % (i):item for i,item in enumerate(a) } 

regex = "|".join([ r"\b(?<g%d>(%s){e<3})\b" % (i,item) for i,item in enumerate(a) ])

sentence1 = "The doog is running in the field"
sentence2 = "The elephent and the kat"

for match in re.finditer(regex, sentence2, flags=re.IGNORECASE|re.UNICODE):
    for key,value in match.groupdict().items():
        if value is not None:
            print ("%s: %s" % (a_dict.get(key), value))

elephant:  elephent
cat:  kat
like image 194
wolfrevokcats Avatar answered Nov 17 '22 17:11

wolfrevokcats