Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Binary search over a huge file with unknown line length

I'm working with huge data CSV file. Each file contains milions of record , each record has a key. The records are sorted by thier key. I dont want to go over the whole file when searching for certian data. I've seen this solution : Reading Huge File in Python

But it suggests that you use the same length of lines on the file - which is not supported in my case.

I thought about adding a padding to each line and then keeping a fixed line length , but I'd like to know if there is a better way to do it.

I'm working with python

like image 768
RanZilber Avatar asked Dec 22 '22 05:12

RanZilber


1 Answers

You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:

bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, bytes - 1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid)
    # now realign to a record
    if mid:
        fin.readline()
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search
like image 63
Gabe Avatar answered Dec 26 '22 00:12

Gabe