Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way for a lookup/search in a huge list (python)

-- I just parsed a big file and I created a list containing 42.000 strings/words. I want to query [against this list] to check if a given word/string belongs to it. So my question is:

What is the most efficient way for such a lookup?

A first approach is to sort the list (list.sort()) and then just use

>> if word in list: print 'word' 

which is really trivial and I am sure there is a better way to do it. My goal is to apply a fast lookup that finds whether a given string is in this list or not. If you have any ideas of another data structure, they are welcome. Yet, I want to avoid for now more sophisticated data-structures like Tries etc. I am interested in hearing ideas (or tricks) about fast lookups or any other python library methods that might do the search faster than the simple in.

And also i want to know the index of the search item

like image 949
user229269 Avatar asked Apr 23 '10 18:04

user229269


People also ask

Which data structure gives the fastest lookup speed when used to store a large volume of data Python?

Space-time tradeoff. The fastest way to repeatedly lookup data with millions of entries in Python is using dictionaries. Because dictionaries are the built-in mapping type in Python thereby they are highly optimized.

Are lookups faster with dictionaries or lists in Python?

It is more efficient to use dictionaries for the lookup of elements as it is faster than a list and takes less time to traverse. Moreover, lists keep the order of the elements while dictionary does not. So, it is wise to use a list data structure when you are concerned with the order of the data elements.

How fast is Python dictionary lookup?

Speed. Lookups in lists are O(n), lookups in dictionaries are amortized O(1), with regard to the number of items in the data structure.

What is a fast way of finding the difference between 2 Python lists with 10 million items in each?

Use set. difference() to Find the Difference Between Two Lists in Python.


Video Answer


2 Answers

Don't create a list, create a set. It does lookups in constant time.

If you don't want the memory overhead of a set then keep a sorted list and search through it with the bisect module.

from bisect import bisect_left def bi_contains(lst, item):     """ efficient `item in lst` for sorted lists """     # if item is larger than the last its not in the list, but the bisect would      # find `len(lst)` as the index to insert, so check that first. Else, if the      # item is in the list then it has to be at index bisect_left(lst, item)     return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item) 
like image 113
Jochen Ritzel Avatar answered Sep 20 '22 18:09

Jochen Ritzel


A point about sets versus lists that hasn't been considered: in "parsing a big file" one would expect to need to handle duplicate words/strings. You haven't mentioned this at all.

Obviously adding new words to a set removes duplicates on the fly, at no additional cost of CPU time or your thinking time. If you try that with a list it ends up O(N**2). If you append everything to a list and remove duplicates at the end, the smartest way of doing that is ... drum roll ... use a set, and the (small) memory advantage of a list is likely to be overwhelmed by the duplicates.

like image 32
John Machin Avatar answered Sep 20 '22 18:09

John Machin