I'm trying to write a spellchecker module.
It loads a text, creates a dictionary from 16 mb file and then checks if encountered word is similar to the word in dictionary (similar = varies up to two chars) if so then it changes it to the form from dictionary.
Right now I'm using a Levenshtein Distance algorithm and processing of a 50 words set takes 3 min...
I'm pretty sure that there must be a faster solution. Profiler told me that my app spends more than 80% of it's time in Levenshtein Distance function.
Are there any better solutions/algorithms?
Here is the implemented of version of the algorithm I use:
def levenshteinDistance(s1, s2):
l_s1 = len(s1)
l_s2 = len(s2)
d = [[a for a in genM(x, l_s2 + 1)] for x in xrange(l_s1 + 1)]
for i in xrange(1, l_s1 + 1):
for j in xrange(1, l_s2 + 1):
d[i][j] = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + decide_of_equality(s1[i - 1],s2[j - 1]))
return d[l_s1][l_s2]
I have used Norvig's spell corrector, mentioned in the comments and it is awesome.
However coming to your problem, you have written a dynamic programming edit distance Algorithm. Your algorithm qualifies to be a data parallel algorithm. On a shared memory i.e on a single machine if you have multi cores you can exploit them. Do you know something called map-reduce? Please dont think distributed and all right now, just consider one single quad core machine and a shared memory. As a step 1 you can partition your dictionary and allocate a portion to each thread which will run edit distance on a portion of dictionary (similar to a map step). Later all your threads will return you all the words at an edit distance of 2 (similar to reduce step). This way your program will benefit from multi core architecture.
Another thing I could think of is inside your python code write the cpu intensive edit distance algorithm in C i.e by writing a python extension.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With