I am doing a text search in a rather big txt file (100k lines, 7mo) Text is not that big but I need a lot of searches. I want to look for a target string and return the line where it appears. My text file is formatted so that the target can only appear in one line.
What is the most efficient way? I do a lot of searches so I want to improve speed. Here is mycode right now:
def lookup_line(target):
#returns line of the target, or None if doesnt exist
line=None
dir=os.path.dirname(__file__)
path=dir+'/file.txt'
file=open(path,'r')
while line==None:
l=file.readline()
l=unicode(l,'utf-8')
if target in l:
break
if l=='': break #happens at end of file, then stop loop
line=l
if line=='':line=None #end of file, nothing has been found
file.close()
return line
I use this python code for a google Appengine app.
Thanks!
To read large text files in Python, we can use the file object as an iterator to iterate over the file and perform the required task. Since the iterator just iterates over the entire file and does not require any additional data structure for data storage, the memory consumed is less comparatively.
Method 1: Finding the index of the string in the text file using readline() In this method, we are using the readline() function, and checking with the find() function, this method returns -1 if the value is not found and if found it returns 0.
text.count('\n',0,pos)
to get the line number.The loop in Python is slow. String searching is very fast. If you need to look for several strings, use regular expressions.
If that's not fast enough, use an external program like grep
.
If you are searching the same text file over and over, consider indexing the file. For example, create a dictionary that maps each word to which lines it's on. This will take a while to create, but will then make searches O(1).
If you are searching different text files, or can't index the file for some reason, you probably won't get any faster than the KMP algorithm.
EDIT: The index I described will only work for single word searches, not multi-word searches. If you want to search for multiple words (any string) then you probably won't be able to index it.
First, don't explicitly decode bytes.
from io import open
Second, consider things like this.
with open(path,'r',encoding='UTF-8') as src:
found= None
for line in src:
if len(line) == 0: break #happens at end of file, then stop loop
if target in line:
found= line
break
return found
This can be simplified slightly to use return None
or return line
instead of break
. It should run a hair faster, but it's slightly harder to make changes when there are multiple returns.
How about 10GB/s search speeds? https://www.codeproject.com/Articles/5282980/Fastest-Fulltext-Vector-Scalar-Exact-Searcher
What is the most efficient way?
The most efficient way is using vectors, if not available then the fastest SCALAR memmem() function you can get, it happened that the article above shows them both in action, you need huge text files being traversed then the memmem() variant Railgun_NyoTengu() being open-source in public domain is way to go.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With