Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Opening a 25GB text file for processing

I have a 25GB file I need to process. Here is what I'm currently doing, but it takes an extremely long time to open:

collection_pricing = os.path.join(pricing_directory, 'collection_price')
with open(collection_pricing, 'r') as f:
    collection_contents = f.readlines()

length_of_file = len(collection_contents)

for num, line in enumerate(collection_contents):
    print '%s / %s' % (num+1, length_of_file)
    cursor.execute(...)

How could I improve this?

like image 468
David542 Avatar asked Dec 15 '22 19:12

David542


1 Answers

  1. Unless the lines in your file is really, really big, do not print the progress at every line. Printing to a terminal is very slow. Print progress e.g. every 100 or every 1000 lines.

  2. Use the available operating system facilities to get the size of a file - os.path.getsize() , see Getting file size in Python?

  3. Get rid of readlines() to avoid reading 25GB into memory. Instead read and process line by line, see e.g. How to read large file, line by line in python

like image 124
nos Avatar answered Jan 01 '23 14:01

nos