Searching for a string in a large text file - profiling various methods in python

Tags:

This question has been asked many times. After spending some time reading the answers, I did some quick profiling to try out the various methods mentioned previously...

I have a 600 MB file with 6 million lines of strings (Category paths from DMOZ project).

The entry on each line is unique.

I want to load the file once & keep searching for matches in the data

The three methods that I tried below list the time taken to load the file, search time for a negative match & memory usage in the task manager

1) set :     (i)  data   = set(f.read().splitlines())     (ii) result = search_str in data

Load time ~ 10s, Search time ~ 0.0s, Memory usage ~ 1.2GB

2) list :     (i)  data   = f.read().splitlines()     (ii) result = search_str in data

Load time ~ 6s, Search time ~ 0.36s, Memory usage ~ 1.2GB

3) mmap :     (i)  data   = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)     (ii) result = data.find(search_str)

Load time ~ 0s, Search time ~ 5.4s, Memory usage ~ NA

4) Hash lookup (using code from @alienhard below):

Load time ~ 65s, Search time ~ 0.0s, Memory usage ~ 250MB

5) File search (using code from @EOL below):       with open('input.txt') as f:        print search_str in f #search_str ends with the ('\n' or '\r\n') as in the file

Load time ~ 0s, Search time ~ 3.2s, Memory usage ~ NA

6) sqlite (with primary index on url):

Load time ~ 0s, Search time ~ 0.0s, Memory usage ~ NA

For my use case, it seems like going with the set is the best option as long as I have sufficient memory available. I was hoping to get some comments on these questions :

A better alternative e.g. sqlite ?

Ways to improve the search time using mmap. I have a 64-bit setup. [edit] e.g. bloom filters

As the file size grows to a couple of GB, is there any way I can keep using 'set' e.g. split it in batches ..

[edit 1] P.S. I need to search frequently, add/remove values and cannot use a hash table alone because I need to retrieve the modified values later.

Any comments/suggestions are welcome !

[edit 2] Update with results from methods suggested in answers [edit 3] Update with sqlite results

Solution : Based on all the profiling & feeback, I think I'll go with sqlite. Second alternative being method 4. One downside of sqlite is that the database size is more than double of the original csv file with urls. This is due to the primary index on url

992

asked Jun 02 '11 19:06

user

1 Answers

Variant 1 is great if you need to launch many sequential searches. Since set is internally a hash table, it's rather good at search. It takes time to build, though, and only works well if your data fit into RAM.

Variant 3 is good for very big files, because you have plenty of address space to map them and OS caches enough data. You do a full scan; it can become rather slow once your data stop to fit into RAM.

SQLite is definitely a nice idea if you need several searches in row and you can't fit the data into RAM. Load your strings into a table, build an index, and SQLite builds a nice b-tree for you. The tree can fit into RAM even if data don't (it's a bit like what @alienhard proposed), and even if it doesn't, the amount if I/O needed is dramatically lower. Of course, you need to create a disk-based SQLite database. I doubt that memory-based SQLite will beat Variant 1 significantly.

199

answered Oct 07 '22 05:10

9000

Related questions
                            
                                Convert string into Date type on Python [duplicate]
                            
                                error: could not create '/Library/Python/2.7/site-packages/xlrd': Permission denied
                            
                                How do you alias a python class to have another name without using inheritance?
                            
                                What is the best stemming method in Python?
                            
                                scikit's GridSearch and Python in general are not freeing memory
                            
                                Emacs Inferior Python shell shows the send message with each python-shell-send-region command
                            
                                AppEngine bulkloader, high replication store and python27 runtime
                            
                                Logistic Regression PMML won't Produce Probabilities
                            
                                Out-of-core processing of sparse CSR arrays
                            
                                How can I define algebraic data types in Python?
                            
                                Python setuptools: how to include a config file for distribution into <prefix>/etc
                            
                                SQLAlchemy: Hybrid expression with relationship
                            
                                Can I write native iPhone, Android, Windows, Blackberry apps using Python? [duplicate]
                            
                                Return results from multiple models with Django REST Framework
                            
                                Why isn't __new__ in Python new-style classes a class method?
                            
                                Plug in django-allauth as endpoint in django-rest-framework
                            
                                Difference between different ways to create celery task
                            
                                Flask App: Update progress bar while function runs
                            
                                Specifying dtype float32 with pandas.read_csv on pandas 0.10.1
                            
                                Is there a Python language specification?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Searching for a string in a large text file - profiling various methods in python

Tags:

performance

python

search

profiling

large-files

user

People also ask

1 Answers

9000

Recent Activity

Donate For Us