Most efficient way for a lookup/search in a huge list (python)

Tags:

-- I just parsed a big file and I created a list containing 42.000 strings/words. I want to query [against this list] to check if a given word/string belongs to it. So my question is:

What is the most efficient way for such a lookup?

A first approach is to sort the list (list.sort()) and then just use

>> if word in list: print 'word'

which is really trivial and I am sure there is a better way to do it. My goal is to apply a fast lookup that finds whether a given string is in this list or not. If you have any ideas of another data structure, they are welcome. Yet, I want to avoid for now more sophisticated data-structures like Tries etc. I am interested in hearing ideas (or tricks) about fast lookups or any other python library methods that might do the search faster than the simple in.

And also i want to know the index of the search item

949

asked Apr 23 '10 18:04

user229269

Video Answer

2 Answers

Don't create a list, create a set. It does lookups in constant time.

If you don't want the memory overhead of a set then keep a sorted list and search through it with the bisect module.

from bisect import bisect_left def bi_contains(lst, item):     """ efficient `item in lst` for sorted lists """     # if item is larger than the last its not in the list, but the bisect would      # find `len(lst)` as the index to insert, so check that first. Else, if the      # item is in the list then it has to be at index bisect_left(lst, item)     return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)

113

answered Sep 20 '22 18:09

Jochen Ritzel

A point about sets versus lists that hasn't been considered: in "parsing a big file" one would expect to need to handle duplicate words/strings. You haven't mentioned this at all.

Obviously adding new words to a set removes duplicates on the fly, at no additional cost of CPU time or your thinking time. If you try that with a list it ends up O(N**2). If you append everything to a list and remove duplicates at the end, the smartest way of doing that is ... drum roll ... use a set, and the (small) memory advantage of a list is likely to be overwhelmed by the duplicates.

answered Sep 20 '22 18:09

John Machin

Related questions
                            
                                Python: Sharing global variables between modules and classes therein
                            
                                Python: try-except as an Expression?
                            
                                Python socket bind to any IP?
                            
                                ImportError: Could not import settings
                            
                                Compressing `x if x else y` statement in Python
                            
                                Filter with Array column with Postgres and SQLAlchemy
                            
                                Why did Django 1.9 replace tuples () with lists [] in settings and URLs?
                            
                                Using a regular expression to replace upper case repeated letters in python with a single lowercase letter
                            
                                Is there a gi.repository documentation for python?
                            
                                Python 2.7 or Python 3 (for speed)? [closed]
                            
                                Negative form of isinstance() in Python
                            
                                User groups and permissions
                            
                                How can I create a Python timestamp with millisecond granularity?
                            
                                Inserting a row at a specific location in a 2d array in numpy?
                            
                                How to put items into priority queues?
                            
                                Pandas reset index on series to remove multiindex
                            
                                Login to Facebook using python requests
                            
                                matplotlib bar chart: space out bars
                            
                                django excel xlwt
                            
                                How expensive are Python dictionaries to handle?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Most efficient way for a lookup/search in a huge list (python)

Tags:

performance

python

list

search