Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up writing to files

Tags:

I've profiled some legacy code I've inherited with cProfile. There were a bunch of changes I've already made that have helped (like using simplejson's C extensions!).

Basically this script is exporting data from one system to an ASCII fixed-width file. Each row is a record, and it has many values. Each line is 7158 characters and contains a ton of spaces. The total record count is 1.5 million records. Each row is generated one at a time, and takes a while (5-10 rows a second).

As each row is generated it's written to disk as simply as possible. The profiling indicates that about 19-20% of the total time is spent in file.write(). For a test case of 1,500 rows that's 20 seconds. I'd like to reduce that number.

Now it seems the next win will be reducing the amount of time spent writing to disk. I'd like to reduce it, if possible. I can keep a cache of records in memory, but I can't wait until the end and dump it all at once.

fd = open(data_file, 'w') for c, (recordid, values) in enumerate(generatevalues()):         row = prep_row(recordid, values)         fd.write(row)         if c % 117 == 0:                 if limit > 0 and c >= limit:                         break                 sys.stdout.write('\r%s @ %s' % (str(c + 1).rjust(7), datetime.now()))                 sys.stdout.flush() 

My first thought would be to keep a cache of records in a list and write them out in batches. Would that be faster? Something like:

rows = [] for c, (recordid, values) in enumerate(generatevalues()):         rows.append(prep_row(recordid, values))         if c % 117 == 0:             fd.write('\n'.join(rows))             rows = [] 

My second thought would be to use another thread, but that makes me want to die inside.

like image 793
chmullig Avatar asked Feb 10 '11 19:02

chmullig


People also ask

Is writing to file slow?

Writing to the disk is slow. There is really nothing you can do about it. It simply takes a very large amount of time to write stuff out to disk. There is almost nothing you can do to speed it up.

How does Python handle large files?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.


1 Answers

Actually, your problem is not that file.write() takes 20% of your time. Its that 80% of the time you aren't in file.write()!

Writing to the disk is slow. There is really nothing you can do about it. It simply takes a very large amount of time to write stuff out to disk. There is almost nothing you can do to speed it up.

What you want is for that I/O time to be the biggest part of the program so that your speed is limited by the speed of the hard disk not your processing time. The ideal is for file.write() to have 100% usage!

like image 60
Winston Ewert Avatar answered Sep 18 '22 21:09

Winston Ewert