I am doing a text search in a rather big txt file (100k lines, 7mo) Text is not that big but I need a lot of searches. I want to look for a target string and return the line where it appears. My text file is formatted so that the target can only appear in one line. What is the most efficient way? I do a lot of searches so I want to improve speed. Here is mycode right now: <pre class="prettyprint"><code>def lookup_line(target): #returns line of the target, or None if doesnt exist line=None dir=os.path.dirname(__file__) path=dir+'/file.txt' file=open(path,'r') while line==None: l=file.readline() l=unicode(l,'utf-8') if target in l: break if l=='': break #happens at end of file, then stop loop line=l if line=='':line=None #end of file, nothing has been found file.close() return line </code></pre> I use this python code for a google Appengine app. Thanks!

<ol> <li>Load the whole text in RAM at once. Don't read line by line.</li> <li>Search for the pattern in the blob. If you find it, use <code>text.count('\n',0,pos)</code> to get the line number.</li> <li>If you don't need the line number, look for the previous and next EOL to cut the line out of the text.</li> </ol> The loop in Python is slow. String searching is very fast. If you need to look for several strings, use regular expressions. If that's not fast enough, use an external program like <code>grep</code>.

Fastest Text search method in a large text file

Tags:

python

google-app-engine

I am doing a text search in a rather big txt file (100k lines, 7mo) Text is not that big but I need a lot of searches. I want to look for a target string and return the line where it appears. My text file is formatted so that the target can only appear in one line.

What is the most efficient way? I do a lot of searches so I want to improve speed. Here is mycode right now:

def lookup_line(target):
    #returns line of the target, or None if doesnt exist
    line=None
    dir=os.path.dirname(__file__)
    path=dir+'/file.txt'
    file=open(path,'r')
    while line==None:
        l=file.readline()
        l=unicode(l,'utf-8')
        if target in l:
            break
        if l=='': break #happens at end of file, then stop loop
    line=l
    if line=='':line=None #end of file, nothing has been found
    file.close()
    return line

I use this python code for a google Appengine app.

Thanks!

922

asked Aug 10 '10 13:08

user375348

4 Answers

Load the whole text in RAM at once. Don't read line by line.
Search for the pattern in the blob. If you find it, use text.count('\n',0,pos) to get the line number.
If you don't need the line number, look for the previous and next EOL to cut the line out of the text.

The loop in Python is slow. String searching is very fast. If you need to look for several strings, use regular expressions.

If that's not fast enough, use an external program like grep.

123

answered Oct 13 '22 00:10

Aaron Digulla

If you are searching the same text file over and over, consider indexing the file. For example, create a dictionary that maps each word to which lines it's on. This will take a while to create, but will then make searches O(1).

If you are searching different text files, or can't index the file for some reason, you probably won't get any faster than the KMP algorithm.

EDIT: The index I described will only work for single word searches, not multi-word searches. If you want to search for multiple words (any string) then you probably won't be able to index it.

answered Oct 13 '22 00:10

Niki Yoshiuchi

First, don't explicitly decode bytes.

from io import open

Second, consider things like this.

with open(path,'r',encoding='UTF-8') as src:
    found= None
    for line in src:
        if len(line) == 0: break #happens at end of file, then stop loop
        if target in line:
            found= line
            break
    return found

This can be simplified slightly to use return None or return line instead of break. It should run a hair faster, but it's slightly harder to make changes when there are multiple returns.

answered Oct 12 '22 22:10

S.Lott

How about 10GB/s search speeds? https://www.codeproject.com/Articles/5282980/Fastest-Fulltext-Vector-Scalar-Exact-Searcher

What is the most efficient way?

The most efficient way is using vectors, if not available then the fastest SCALAR memmem() function you can get, it happened that the article above shows them both in action, you need huge text files being traversed then the memmem() variant Railgun_NyoTengu() being open-source in public domain is way to go.

answered Oct 12 '22 22:10

Georgi

Related questions
                            
                                Python on MacOS "dyld: Library not loaded" - error
                            
                                How to concatenate values of all rows in a dataframe into a single row without altering the columns?
                            
                                Python format phone number
                            
                                Split string into list of two words, repeating the last word
                            
                                AttributeError: module 'profile' has no attribute 'run'
                            
                                Python module functions used in unexpected ways
                            
                                Python string formatting
                            
                                How do I prevent execution of arbitrary commands from a Django app making system calls?
                            
                                Case-insensitive comparison of sets in Python
                            
                                Why does my Python class claim that I have 2 arguments instead of 1?
                            
                                python regular expression across multiple lines
                            
                                How to redirect complete output of a cron script
                            
                                C#, Pass Array As Function Parameters
                            
                                Can you really scale up with Django...given that you can only use one database? (In the models.py and settings.py)
                            
                                Really weird issue with shelve (python)
                            
                                Python instances and attributes: is this a bug or i got it totally wrong?
                            
                                convert a list of booleans to string
                            
                                installing simplejson on the google appengine
                            
                                Use Twisted's getPage as urlopen?
                            
                                Forcing to make floating point calculations

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With