I have a program in Python that basically does the following: <pre class="prettyprint"><code>for j in xrange(200): # 1) Compute a bunch of data # 2) Write data to disk </code></pre> 1) takes about 2-5 minutes 2) takes about ~1 minute Note that there is too much data to keep in memory. Ideally what I would like to do is write the data to disk in a way that avoids idling the CPU. Is this possible in Python? Thanks!

You could try using multiple processes like this: <pre class="prettyprint"><code>import multiprocessing as mp def compute(j): # compute a bunch of data return data def write(data): # write data to disk if __name__ == '__main__': pool = mp.Pool() for j in xrange(200): pool.apply_async(compute, args=(j, ), callback=write) pool.close() pool.join() </code></pre> <code>pool = mp.Pool()</code> will create a pool of worker processes. By default, the number of workers equals the number of CPU cores your machine has. Each pool.apply_async call queues a task to be run by a worker in the pool of worker processes. When a worker is available, it runs <code>compute(j)</code>. When the worker returns a value, <code>data</code>, a thread in the main process runs the callback function <code>write(data)</code>, with <code>data</code> being the data returned by the worker. Some caveats: <ul> <li>The data has to be picklable, since it is being communicated from the worker process back to the main process via a Queue.</li> <li>There is no guarantee that the order in which the workers complete tasks is the same as the order in which the tasks were sent to the pool. So the order in which the data is written to disk may not correspond to <code>j</code> ranging from 0 to 199. One way around this problem would be to write the data to a sqlite (or other kind of) database with <code>j</code> as one of the fields of data. Then, when you wish to read the data in order, you could <code>SELECT * FROM table ORDER BY j</code>.</li> <li> Using multiple processes will increase the amount of memory required as data is generated by the worker processes and data waiting to be written to disk accumulates in the Queue. You might be able to reduce the amount of memory required by using NumPy arrays. If that is not possible, then you might have to reduce the number of processes: <pre class="prettyprint"><code>pool = mp.Pool(processes=1) </code></pre> That will create one worker process (to run <code>compute</code>), leaving the main process to run <code>write</code>. Since <code>compute</code> takes longer than <code>write</code>, the Queue won't get backed up with more than one chunk of data to be written to disk. However, you would still need enough memory to compute on one chunk of data while writing a different chunk of data to disk. If you do not have enough memory to do both simultaneously, then you have no choice -- your original code, which runs <code>compute</code> and <code>write</code> sequentially, is the only way. </li> </ul>

Write data to disk in Python as a background process

Tags:

python

file

multiprocessing

I have a program in Python that basically does the following:

Click to copy

for j in xrange(200):
    # 1) Compute a bunch of data
    # 2) Write data to disk

1) takes about 2-5 minutes
2) takes about ~1 minute

Note that there is too much data to keep in memory.

Ideally what I would like to do is write the data to disk in a way that avoids idling the CPU. Is this possible in Python? Thanks!

419

asked Apr 25 '13 12:04

Joel Vroom

1 Answers

You could try using multiple processes like this:

Click to copy

import multiprocessing as mp

def compute(j):
    # compute a bunch of data
    return data

def write(data):
    # write data to disk

if __name__ == '__main__':
    pool = mp.Pool()
    for j in xrange(200):
        pool.apply_async(compute, args=(j, ), callback=write)
    pool.close()
    pool.join()

pool = mp.Pool() will create a pool of worker processes. By default, the number of workers equals the number of CPU cores your machine has.

Each pool.apply_async call queues a task to be run by a worker in the pool of worker processes. When a worker is available, it runs compute(j). When the worker returns a value, data, a thread in the main process runs the callback function write(data), with data being the data returned by the worker.

Some caveats:

The data has to be picklable, since it is being communicated from the worker process back to the main process via a Queue.
There is no guarantee that the order in which the workers complete tasks is the same as the order in which the tasks were sent to the pool. So the order in which the data is written to disk may not correspond to j ranging from 0 to 199. One way around this problem would be to write the data to a sqlite (or other kind of) database with j as one of the fields of data. Then, when you wish to read the data in order, you could SELECT * FROM table ORDER BY j.
Using multiple processes will increase the amount of memory required as data is generated by the worker processes and data waiting to be written to disk accumulates in the Queue. You might be able to reduce the amount of memory required by using NumPy arrays. If that is not possible, then you might have to reduce the number of processes:

Click to copy
```
pool = mp.Pool(processes=1) 
```
That will create one worker process (to run compute), leaving the main process to run write. Since compute takes longer than write, the Queue won't get backed up with more than one chunk of data to be written to disk. However, you would still need enough memory to compute on one chunk of data while writing a different chunk of data to disk.

If you do not have enough memory to do both simultaneously, then you have no choice -- your original code, which runs compute and write sequentially, is the only way.

149

answered Oct 07 '22 07:10

unutbu

Related questions
                            
                                Python Cubes OLAP Framework - how to work with joins?
                            
                                Why are Python builds suddenly not Framework builds when using virtualenv?
                            
                                Overlay polygon on top of image in Python
                            
                                Cannot open ".mp4" video files using OpenCV 2.4.3, Python 2.7 in Windows 7 machine
                            
                                Perform simple math on regular expression output? (Python)
                            
                                Python: process image and save to file stream
                            
                                Django DateTimeField() and timezone.now()
                            
                                Delete letters from string
                            
                                Django1.4: How to use order_by in template?
                            
                                Return the indexes of a sub-array in an array
                            
                                How to get more than one field with django filter icontains
                            
                                C++ alternative to OS.walk
                            
                                Converting UNIX time to datetime object in Jinja templates
                            
                                Python regex extract vimeo id from url
                            
                                How to overwrite a file in Python?
                            
                                Trouble understanding lambda functions [duplicate]
                            
                                Komodo Edit disable autocomple
                            
                                How to get count of unpublished commit with GitPython?
                            
                                Python NameError: name 'ctypes' is not defined
                            
                                Open() and codecs.open() in Python 2.7 behave strangely different

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With