Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the optimal way to process a very large (over 30GB) text file and also show progress

[newbie question]

Hi,

I'm working on a huge text file which is well over 30GB.

I have to do some processing on each line and then write it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue screen after about 10% of processing data.

Im currently using this:

f = open(file_path,'r')
for one_line in f.readlines():
    do_some_processing(one_line)
f.close()

Also how can I show overall progress of how much data has been crunched so far ?

Thank you all very much.

like image 711
Raj K Avatar asked Sep 06 '25 22:09

Raj K


2 Answers

File handles are iterable, and you should probably use a context manager. Try this:

with open(file_path, 'r') as fh:
  for line in fh:
    process(line)

That might be enough.

like image 187
g.d.d.c Avatar answered Sep 09 '25 16:09

g.d.d.c


I use a function like this for a similiar problem. You can wrap up any iterable with it.

Change this

for one_line in f.readlines():

You just need to change your code to

# don't use readlines, it creates a big list of all data in memory rather than
# iterating one line at a time.
for one_line in in progress_meter(f, 10000):

You might want to pick a smaller or larger value depending on how much time you want to waste printing status messages.

def progress_meter(iterable, chunksize):
    """ Prints progress through iterable at chunksize intervals."""
    scan_start = time.time()
    since_last = time.time()
    for idx, val in enumerate(iterable):
        if idx % chunksize == 0 and idx > 0: 
            print idx
            print 'avg rate', idx / (time.time() - scan_start)
            print 'inst rate', chunksize / (time.time() - since_last)
            since_last = time.time()
            print
        yield val
like image 43
Rob Neuhaus Avatar answered Sep 09 '25 18:09

Rob Neuhaus