I have a very big list of words (around 200k):
["cat", "the dog", "elephant", "the angry tiger"]
I created this regex, with fuzziness :
regex = "(cat){e<3}|(the dog){e<3}|(elephant){e<3}|(the angry tiger){e<3}"
I have input sentences :
sentence1 = "The doog is running in the field"
sentence2 = "The elephent and the kat"
...
What I want to get is this :
res1 = ["the dog"]
res2 = ["elephant", "cat"]
I tried this for example:
re.findall(regex, sentence2, flags=re.IGNORECASE|re.UNICODE)
But this outputs me :
["elephent", "kat"]
Any idea how to get the right answer with corrected words ? What I want is to get the regex capturing group for each match but I struggle to do so.
Maybe I'm not doing this right and maybe the regex way is not the good one but the if item in list
with a for
loop is way too long to execute.
It can be done by manually constructing the regex and naming the groups:
import regex as re
a = ["cat", "the dog", "elephant", "the angry tiger"]
a_dict = { 'g%d' % (i):item for i,item in enumerate(a) }
regex = "|".join([ r"\b(?<g%d>(%s){e<3})\b" % (i,item) for i,item in enumerate(a) ])
sentence1 = "The doog is running in the field"
sentence2 = "The elephent and the kat"
for match in re.finditer(regex, sentence2, flags=re.IGNORECASE|re.UNICODE):
for key,value in match.groupdict().items():
if value is not None:
print ("%s: %s" % (a_dict.get(key), value))
elephant: elephent
cat: kat
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With