Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python searching a large list speed

I have run into a speed issue searching through a very large list. I have a file with a lot of errors and very strange words in it. I am trying to use difflib to find the closest match in a dictionary file I have that has 650,000 words in it. This approach below works really well but is very very slow and I was wondering if there is a better way to approach this problem. This is the code:

from difflib import SequenceMatcher
headWordList = [ #This is a list of 650,000 words]


openFile = open("sentences.txt","r")

for line in openFile:
    sentenceList.append[line]

percentage = 0
count = 0

for y in sentenceList:
      if y not in headwordList:

         for x in headwordList:
             m = SequenceMatcher(None, y.lower(), x)

             if m.ratio() > percentage:
                 percentage = m.ratio()

                 word = x

         if percentage > 0.86:        
             sentenceList[count] = word
count=count+1

Thanks for the help, software engineering is not even close to my strong suit. Much appreciated.

like image 814
English Grad Avatar asked Dec 23 '13 21:12

English Grad


People also ask

How does Python handle large lists?

You could use the Iterator. rank() function to reduce each hand to a single int, store those in a compact array, then use Iterator.

Is list fast in Python?

Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1).

What is faster than Python list?

From the output of the above program, we see that the NumPy Arrays execute very much faster than the Lists in Python. There is a big difference between the execution time of arrays and lists.


1 Answers

Two things that might provide some small help:

1) Use the approach in this SO answer to read through your large file the most efficiently.

2) Change your code from

for x in headwordList:
    m = SequenceMatcher(None, y.lower(), 1)

to

yLower = y.lower()
for x in headwordList:
    m = SequenceMatcher(None, yLower, 1)

You're converting each sentence to lower 650,000 times. No need for that.

like image 153
Dillon Welch Avatar answered Nov 14 '22 09:11

Dillon Welch