The text file contains two columns- index number(5 spaces) and characters(30 spaces). It is arranged in lexicographic order. I want to perform binary search to search for the keyword.
I wrote a simple Python 3.6+ package that can do this. (See its github page for more information!)
Installation: pip install binary_file_search
Example file:
1,one
2,two_a
2,two_b
3,three
Usage:
from binary_file_search.BinaryFileSearch import BinaryFileSearch
with BinaryFileSearch('example.file', sep=',', string_mode=False) as bfs:
# assert bfs.is_file_sorted() # test if the file is sorted.
print(bfs.search(2))
Result: [[2, 'two_a'], [2, 'two_b']]
Here's an interesting way to do it with Python's built-in bisect module.
import bisect
import os
class Query(object):
def __init__(self, query, index=5):
self.query = query
self.index = index
def __lt__(self, comparable):
return self.query < comparable[self.index:]
class FileSearcher(object):
def __init__(self, file_pointer, record_size=35):
self.file_pointer = file_pointer
self.file_pointer.seek(0, os.SEEK_END)
self.record_size = record_size + len(os.linesep)
self.num_bytes = self.file_pointer.tell()
self.file_size = (self.num_bytes // self.record_size)
def __len__(self):
return self.file_size
def __getitem__(self, item):
self.file_pointer.seek(item * self.record_size)
return self.file_pointer.read(self.record_size)
if __name__ == '__main__':
with open('data.dat') as file_to_search:
query = raw_input('Query: ')
wrapped_query = Query(query)
searchable_file = FileSearcher(file_to_search)
print "Located @ line: ", bisect.bisect(searchable_file, wrapped_query)
Do you need do do a binary search? If not, try converting your flatfile into a cdb (constant database). This will give you very speedy hash lookups to find the index for a given word:
import cdb
# convert the corpus file to a constant database one time
db = cdb.cdbmake('corpus.db', 'corpus.db_temp')
for line in open('largecorpus.txt', 'r'):
index, word = line.split()
db.add(word, index)
db.finish()
In a separate script, run queries against it:
import cdb
db = cdb.init('corpus.db')
db.get('chaos')
12345
If you need to find a single keyword in a file:
line_with_keyword = next((line for line in open('file') if keyword in line),None)
if line_with_keyword is not None:
print line_with_keyword # found
To find multiple keywords you could use set()
as @kriegar suggested:
def extract_keyword(line):
return line[5:35] # assuming keyword starts on 6 position and has length 30
with open('file') as f:
keywords = set(extract_keyword(line) for line in f) # O(n) creation
if keyword in keywords: # O(1) search
print(keyword)
You could use dict()
above instead of set()
to preserve index
information.
Here's how you could do a binary search on a text file:
import bisect
lines = open('file').readlines() # O(n) list creation
keywords = map(extract_keyword, lines)
i = bisect.bisect_left(keywords, keyword) # O(log(n)) search
if keyword == keywords[i]:
print(lines[i]) # found
There is no advantage compared to the set()
variant.
Note: all variants except the first one load the whole file in memory. FileSearcher()
suggested by @Mahmoud Abdelkader don't require to load the whole file in memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With