I've profiled some legacy code I've inherited with cProfile. There were a bunch of changes I've already made that have helped (like using simplejson's C extensions!).
Basically this script is exporting data from one system to an ASCII fixed-width file. Each row is a record, and it has many values. Each line is 7158 characters and contains a ton of spaces. The total record count is 1.5 million records. Each row is generated one at a time, and takes a while (5-10 rows a second).
As each row is generated it's written to disk as simply as possible. The profiling indicates that about 19-20% of the total time is spent in file.write()
. For a test case of 1,500 rows that's 20 seconds. I'd like to reduce that number.
Now it seems the next win will be reducing the amount of time spent writing to disk. I'd like to reduce it, if possible. I can keep a cache of records in memory, but I can't wait until the end and dump it all at once.
fd = open(data_file, 'w') for c, (recordid, values) in enumerate(generatevalues()): row = prep_row(recordid, values) fd.write(row) if c % 117 == 0: if limit > 0 and c >= limit: break sys.stdout.write('\r%s @ %s' % (str(c + 1).rjust(7), datetime.now())) sys.stdout.flush()
My first thought would be to keep a cache of records in a list and write them out in batches. Would that be faster? Something like:
rows = [] for c, (recordid, values) in enumerate(generatevalues()): rows.append(prep_row(recordid, values)) if c % 117 == 0: fd.write('\n'.join(rows)) rows = []
My second thought would be to use another thread, but that makes me want to die inside.
Writing to the disk is slow. There is really nothing you can do about it. It simply takes a very large amount of time to write stuff out to disk. There is almost nothing you can do to speed it up.
Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.
Actually, your problem is not that file.write()
takes 20% of your time. Its that 80% of the time you aren't in file.write()
!
Writing to the disk is slow. There is really nothing you can do about it. It simply takes a very large amount of time to write stuff out to disk. There is almost nothing you can do to speed it up.
What you want is for that I/O time to be the biggest part of the program so that your speed is limited by the speed of the hard disk not your processing time. The ideal is for file.write()
to have 100% usage!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With