I'm working with huge data CSV file. Each file contains milions of record , each record has a key. The records are sorted by thier key. I dont want to go over the whole file when searching for certian data. I've seen this solution : Reading Huge File in Python
But it suggests that you use the same length of lines on the file - which is not supported in my case.
I thought about adding a padding to each line and then keeping a fixed line length , but I'd like to know if there is a better way to do it.
I'm working with python
You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:
bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
left, right = 0, bytes - 1
key = None
while key != search and left <= right:
mid = (left + right) / 2
fin.seek(mid)
# now realign to a record
if mid:
fin.readline()
key, value = map(int, fin.readline().split())
if search > key:
left = mid + 1
else:
right = mid - 1
if key != search:
value = None # for when search key is not found
search.result = value # store the result of the search
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With