I want to get the most relevant word from enchant suggest()
. Is there any better way to do that. I feel my function is not efficient when it comes to checking large set of words in the range of 100k or more.
Problem with enchant suggest()
:
>>> import enchant >>> d.suggest("prfomnc") ['prominence', 'performance', 'preform', 'Provence', 'preferment', 'proforma']
My function to get the appropriate word from a set of suggested words:
import enchant, difflib word="prfomnc" dict,max = {},0 a = set(d.suggest(word)) for b in a: tmp = difflib.SequenceMatcher(None, word, b).ratio(); dict[tmp] = b if tmp > max: max = tmp print dict[max] Result: performance
Updated:
if I get multiple keys, meaning same difflib ratio()
values, I use multi-key dictionary. As explained here: http://code.activestate.com/recipes/440502-a-dictionary-with-multiple-values-for-each-key/
No magic bullet, I'm afraid... a few suggestions however.
I'm guessing that most of the time in the logic is spent in the difflib's SequenceMatcher().ratio() call. This wouldn't be surprising since this method uses a variation on the Rattcliff-Obershelp algorithm which is relatively expensive, CPU-wise (but the metric it produces is rather "on the mark" to locate close matches, and that is probably why you like it).
To be sure, you should profile this logic and confirm that indeed SequenceMatcher() is the hot spot. Maybe Enchant.suggest() is also a bit slow, but there would be little we could do, code-wise, to improve this (configuration-wise, there may be a few options, for eg. doing away with personal dictionary to save the double look-upup and merge etc.).
Assuming that SequenceMatcher() is indeed the culprit, and assuming that you wish to stick with the Ratcliff-Obershelp similarity metric as the way to select the best match, you could do [some of] the following:
HTH, good luck ;-)
You don't actuall need to keep a dict
if you are only interested in the best matches
>>> word="prfomnc"
>>> best_words = []
>>> best_ratio = 0
>>> a = set(d.suggest(word))
>>> for b in a:
... tmp = difflib.SequenceMatcher(None, word, b).ratio()
... if tmp > best_ratio:
... best_words = [b]
... best_ratio = tmp
... elif tmp == best_ratio:
... best_words.append(b)
...
>>> best_words
['performance']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With