Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get the most relevant word (spell check) from 'enchant suggest()' in Python

I want to get the most relevant word from enchant suggest(). Is there any better way to do that. I feel my function is not efficient when it comes to checking large set of words in the range of 100k or more.

Problem with enchant suggest():

>>> import enchant
>>> d.suggest("prfomnc")
['prominence', 'performance', 'preform', 'Provence', 'preferment', 'proforma']

My function to get the appropriate word from a set of suggested words:

import enchant, difflib

word="prfomnc"
dict,max = {},0
a = set(d.suggest(word))
for b in a:
    tmp = difflib.SequenceMatcher(None, word, b).ratio();
    dict[tmp] = b
    if tmp > max:
       max = tmp

print dict[max]

Result: performance

Updated:

if I get multiple keys, meaning same difflib ratio() values, I use multi-key dictionary. As explained here: http://code.activestate.com/recipes/440502-a-dictionary-with-multiple-values-for-each-key/

like image 412
Maggie Avatar asked Jul 15 '11 03:07

Maggie


2 Answers

No magic bullet, I'm afraid... a few suggestions however.

I'm guessing that most of the time in the logic is spent in the difflib's SequenceMatcher().ratio() call. This wouldn't be surprising since this method uses a variation on the Rattcliff-Obershelp algorithm which is relatively expensive, CPU-wise (but the metric it produces is rather "on the mark" to locate close matches, and that is probably why you like it).

To be sure, you should profile this logic and confirm that indeed SequenceMatcher() is the hot spot. Maybe Enchant.suggest() is also a bit slow, but there would be little we could do, code-wise, to improve this (configuration-wise, there may be a few options, for eg. doing away with personal dictionary to save the double look-upup and merge etc.).

Assuming that SequenceMatcher() is indeed the culprit, and assuming that you wish to stick with the Ratcliff-Obershelp similarity metric as the way to select the best match, you could do [some of] the following:

  • only compute the SequenceMatcher ratio value for the top (?) 5 items from Enchant.
    After all, Enchant.suggest() returns its suggestions in an ordered fashion with its best guesses first; therefore while based on different heuristics, there's value in the Enchant order as well, the chances of finding a hi-ranking match probably diminish as we move down the list. Also, even though, we may end up ignoring a few such hi-ranking matches, by testing only the top few Enchant suggestions, we somehow combine the "wisdom" found in Enchant's heuristics with these from the Ratcliff-Obershelp metric.
  • stop computing the SequenceMatcher ratio after a certain threshold has been reached
    The idea is similar to the preceding: avoid calling SequenceMatcher once the odds of finding better are getting smaller (and once we have a decent if not best choice in hand)
  • filter out some of the words from Enchant with your own logic.
    The idea is to have a relatively quick/inexpensive test which may tell us that a given word is unlikely to score well on the SequenceMatcher ratio. For example exclude words which do not have at least, say, length of user string minus two characters in common.
    BTW, you can maybe use some of the SequenceMatcher object's [quicker] functions to get some data for the filtering heuristics.
  • use SequenceMatcher *quick_ratio*() function instead
    at least in some cases.
  • only keep the best match, in a string, rather than using a dictionary
    Apparently only the top choice matters, so except for test purposes you may not need the [relatively small] overhead of the dictionary.
  • you may consider writing your own Ratcliff-Obershelp (or similar) method, introducing therein various early exits when the prospect of meeting the current max ratio is small. BEWARE, it would likely be difficult to produce a method as efficient as the C-language one of difflib, your interest in doing this is with the early exits...

HTH, good luck ;-)

like image 173
mjv Avatar answered Nov 02 '22 23:11

mjv


You don't actuall need to keep a dict if you are only interested in the best matches

>>> word="prfomnc"
>>> best_words = []
>>> best_ratio = 0
>>> a = set(d.suggest(word))
>>> for b in a:
...   tmp = difflib.SequenceMatcher(None, word, b).ratio()
...   if tmp > best_ratio:
...     best_words = [b]
...     best_ratio = tmp
...   elif tmp == best_ratio:
...     best_words.append(b)
... 
>>> best_words
['performance']
like image 43
John La Rooy Avatar answered Nov 02 '22 23:11

John La Rooy