[newbie question]
Hi,
I'm working on a huge text file which is well over 30GB.
I have to do some processing on each line and then write it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.
Im currently using this:
f = open(file_path,'r')
for one_line in f.readlines():
do_some_processing(one_line)
f.close()
Also how can I show overall progress of how much data has been crunched so far ?
Thank you all very much.
File handles are iterable, and you should probably use a context manager. Try this:
with open(file_path, 'r') as fh:
for line in fh:
process(line)
That might be enough.
I use a function like this for a similiar problem. You can wrap up any iterable with it.
Change this
for one_line in f.readlines():
You just need to change your code to
# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):
You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.
def progress_meter(iterable, chunksize):
""" Prints progress through iterable at chunksize intervals."""
scan_start = time.time()
since_last = time.time()
for idx, val in enumerate(iterable):
if idx % chunksize == 0 and idx > 0:
print idx
print 'avg rate', idx / (time.time() - scan_start)
print 'inst rate', chunksize / (time.time() - since_last)
since_last = time.time()
print
yield val
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With