I'm trying to a parallelize an application using multiprocessing which takes in a very large csv file (64MB to 500MB), does some work line by line, and then outputs a small, fixed size file. Currently I do a <code>list(file_obj)</code>, which unfortunately is loaded entirely into memory (I think) and I then I break that list up into n parts, n being the number of processes I want to run. I then do a <code>pool.map()</code> on the broken up lists. This seems to have a really, really bad runtime in comparison to a single threaded, just-open-the-file-and-iterate-over-it methodology. Can someone suggest a better solution? Additionally, I need to process the rows of the file in groups which preserve the value of a certain column. These groups of rows can themselves be split up, but no group should contain more than one value for this column.

<code>list(file_obj)</code> can require a lot of memory when <code>fileobj</code> is large. We can reduce that memory requirement by using itertools to pull out chunks of lines as we need them. In particular, we can use <pre class="prettyprint"><code>reader = csv.reader(f) chunks = itertools.groupby(reader, keyfunc) </code></pre> to split the file into processable chunks, and <pre class="prettyprint"><code>groups = [list(chunk) for key, chunk in itertools.islice(chunks, num_chunks)] result = pool.map(worker, groups) </code></pre> to have the multiprocessing pool work on <code>num_chunks</code> chunks at a time. By doing so, we need roughly only enough memory to hold a few (<code>num_chunks</code>) chunks in memory, instead of the whole file. <hr> <pre class="prettyprint"><code>import multiprocessing as mp import itertools import time import csv def worker(chunk): # `chunk` will be a list of CSV rows all with the same name column # replace this with your real computation # print(chunk) return len(chunk) def keyfunc(row): # `row` is one row of the CSV file. # replace this with the name column. return row[0] def main(): pool = mp.Pool() largefile = 'test.dat' num_chunks = 10 results = [] with open(largefile) as f: reader = csv.reader(f) chunks = itertools.groupby(reader, keyfunc) while True: # make a list of num_chunks chunks groups = [list(chunk) for key, chunk in itertools.islice(chunks, num_chunks)] if groups: result = pool.map(worker, groups) results.extend(result) else: break pool.close() pool.join() print(results) if __name__ == '__main__': main() </code></pre>

Chunking data from a large file for multiprocessing?

Tags:

python

parallel-processing

I'm trying to a parallelize an application using multiprocessing which takes in a very large csv file (64MB to 500MB), does some work line by line, and then outputs a small, fixed size file.

Currently I do a list(file_obj), which unfortunately is loaded entirely into memory (I think) and I then I break that list up into n parts, n being the number of processes I want to run. I then do a pool.map() on the broken up lists.

This seems to have a really, really bad runtime in comparison to a single threaded, just-open-the-file-and-iterate-over-it methodology. Can someone suggest a better solution?

Additionally, I need to process the rows of the file in groups which preserve the value of a certain column. These groups of rows can themselves be split up, but no group should contain more than one value for this column.

972

asked Jan 03 '12 18:01

user1040625

2 Answers

list(file_obj) can require a lot of memory when fileobj is large. We can reduce that memory requirement by using itertools to pull out chunks of lines as we need them.

In particular, we can use

reader = csv.reader(f)
chunks = itertools.groupby(reader, keyfunc)

to split the file into processable chunks, and

groups = [list(chunk) for key, chunk in itertools.islice(chunks, num_chunks)]
result = pool.map(worker, groups)

to have the multiprocessing pool work on num_chunks chunks at a time.

By doing so, we need roughly only enough memory to hold a few (num_chunks) chunks in memory, instead of the whole file.

import multiprocessing as mp
import itertools
import time
import csv

def worker(chunk):
    # `chunk` will be a list of CSV rows all with the same name column
    # replace this with your real computation
    # print(chunk)
    return len(chunk)  

def keyfunc(row):
    # `row` is one row of the CSV file.
    # replace this with the name column.
    return row[0]

def main():
    pool = mp.Pool()
    largefile = 'test.dat'
    num_chunks = 10
    results = []
    with open(largefile) as f:
        reader = csv.reader(f)
        chunks = itertools.groupby(reader, keyfunc)
        while True:
            # make a list of num_chunks chunks
            groups = [list(chunk) for key, chunk in
                      itertools.islice(chunks, num_chunks)]
            if groups:
                result = pool.map(worker, groups)
                results.extend(result)
            else:
                break
    pool.close()
    pool.join()
    print(results)

if __name__ == '__main__':
    main()

150

answered Sep 30 '22 07:09

unutbu

I would keep it simple. Have a single program open the file and read it line by line. You can choose how many files to split it into, open that many output files, and every line write to the next file. This will split the file into n equal parts. You can then run a Python program against each of the files in parallel.

answered Sep 30 '22 07:09

Joe

Related questions
                            
                                Keras Classification - Object Detection
                            
                                When to use datetime.utcnow() or datetime.now(tz=pytz.utc).replace(tzinfo=None)
                            
                                pytest-mock mocker in pytest fixture
                            
                                Evaluate all pair combinations of rows of two tensors in tensorflow
                            
                                using python variable outside with statement
                            
                                How does spacy use word embeddings for Named Entity Recognition (NER)?
                            
                                How to load an image and show the image using keras?
                            
                                Why do we need wrapper function in decorators?
                            
                                MatplotLib 'saveFig()' Fullscreen
                            
                                How do I create padded batches in Tensorflow for tf.train.SequenceExample data using the DataSet API?
                            
                                Custom weighted loss function in Keras for weighing each element
                            
                                How many local variables can a Python (CPython implementation) function possibly hold?
                            
                                __init__ function definition without self argument
                            
                                What is the order of execution of __eq__ if one side inherits from the other? [duplicate]
                            
                                Python typing what does TypeVar(A, B, covariant=True) mean?
                            
                                weakref list in python
                            
                                Python: map in place [duplicate]
                            
                                List of References in Google App Engine for Python
                            
                                ReportLab: How to align a textobject?
                            
                                Can i set float128 as the standard float-array in numpy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With