A file contains 10000 lines with one entry in each line. I need to process the file but in batches (small chunks).
file = open("data.txt", "r")
data = file.readlines()
file.close()
total_count = len(data) # equals to ~10000 or less
max_batch = 50 # loop through 'data' with 50 entries at max in each loop.
for i in range(total_count):
batch = data[i:i+50] # first 50 entries
result = process_data(batch) # some time consuming processing on 50 entries
if result == True:
# add to DB that 50 entries are processed successfully!
else:
return 0 # quit the operation
# later start again from the point it failed.
# say 51st or 2560th or 9950th entry
What to do here so that next loop picks entries from 51 to 100th item and so on?
If somehow the operation is not successful and breaks in-between, then need to start loop again only from the batch where it failed (based on DB entry).
I'm not able to code a proper logic. Should I keep two lists? Or anything else?
l = [1,2,3,4,5,6,7,8,9,10]
batch_size = 3
for i in range(0, len(l), batch_size):
print(l[i:i+batch_size])
# more logic here
>>> [1,2,3]
>>> [4,5,6]
>>> [7,8,9]
>>> [10}
I think this is the most straight-forward, readable approach. If you need to retry a certain batch, you can retry inside the loop (serial) or you can open a thread per batch - depends on the application...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With