I have run into a speed issue searching through a very large list. I have a file with a lot of errors and very strange words in it. I am trying to use difflib to find the closest match in a dictionary file I have that has 650,000 words in it. This approach below works really well but is very very slow and I was wondering if there is a better way to approach this problem. This is the code:
from difflib import SequenceMatcher
headWordList = [ #This is a list of 650,000 words]
openFile = open("sentences.txt","r")
for line in openFile:
sentenceList.append[line]
percentage = 0
count = 0
for y in sentenceList:
if y not in headwordList:
for x in headwordList:
m = SequenceMatcher(None, y.lower(), x)
if m.ratio() > percentage:
percentage = m.ratio()
word = x
if percentage > 0.86:
sentenceList[count] = word
count=count+1
Thanks for the help, software engineering is not even close to my strong suit. Much appreciated.
You could use the Iterator. rank() function to reduce each hand to a single int, store those in a compact array, then use Iterator.
Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1).
From the output of the above program, we see that the NumPy Arrays execute very much faster than the Lists in Python. There is a big difference between the execution time of arrays and lists.
Two things that might provide some small help:
1) Use the approach in this SO answer to read through your large file the most efficiently.
2) Change your code from
for x in headwordList:
m = SequenceMatcher(None, y.lower(), 1)
to
yLower = y.lower()
for x in headwordList:
m = SequenceMatcher(None, yLower, 1)
You're converting each sentence to lower 650,000 times. No need for that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With