Python: Process file using multiple cores

Tags:

I am currently trying to read a large file (80 million lines), where I need to make a computationally intensive matrix multiplication for each entry. After calculating this, I want to insert the result into a database. Because of the time intensive manner of this process, I want to split the file onto multiple cores to speed up the process.

After researching I found this promising attempt, which split a file into n parts.

def file_block(fp, number_of_blocks, block):
    '''
    A generator that splits a file into blocks and iterates
    over the lines of one of the blocks.

    '''

    assert 0 <= block and block < number_of_blocks
    assert 0 < number_of_blocks

    fp.seek(0,2)
    file_size = fp.tell()

    ini = file_size * block / number_of_blocks
    end = file_size * (1 + block) / number_of_blocks

    if ini <= 0:
        fp.seek(0)
    else:
        fp.seek(ini-1)
        fp.readline()

    while fp.tell() < end:
        yield fp.readline()

Iteratively, you can call the function like this:

if __name__ == '__main__':
    fp = open(filename)
    number_of_chunks = 4
    for chunk_number in range(number_of_chunks):
        print chunk_number, 100 * '='
        for line in file_block(fp, number_of_chunks, chunk_number):
            process(line)

While this works, I run into problems, parallelizing this using multiprocessing:

fp = open(filename)
number_of_chunks = 4
li = [file_block(fp, number_of_chunks, chunk_number) for chunk_number in range(number_of_chunks)]

p = Pool(cpu_count() - 1)
p.map(processChunk,li)

With the error being, that generators cannot be pickled.

While I understand this error, it is too expensive to first iterate over the whole file to put all lines into a list.

Moreover, I want to use blocks of lines per core per iteration, because it is more efficient to insert multiple lines into the database at once (instead of 1 by 1 if using the typical map approach)

Thanks for your help.

306

asked Nov 22 '16 15:11

bublitz

1 Answers

Instead of creating generators up front and passing them into each thread, leave that to the thread code.

def processChunk(params):
    filename, chunk_number, number_of_chunks = params
    with open(filename, 'r') as fp:
        for line in file_block(fp, number_of_chunks, chunk_number):
            process(line)

li = [(filename, i, number_of_chunks) for i in range(number_of_chunks)]
p.map(processChunk, li)

178

answered Oct 27 '22 19:10

Mark Ransom

Related questions
                            
                                Mypy "class module" annotation
                            
                                How to DISABLE Jupyter notebook matplotlib plot inline?
                            
                                Pushing local branch to remote branch
                            
                                Wx can't run on mac
                            
                                Resuming an optimization in scipy.optimize?
                            
                                R Markdown code folding doesn't work with bash, Python code chunks
                            
                                Test if byte is empty in python
                            
                                Django - distinct rows/objects distinguished by date/day from datetime field
                            
                                Find indices of a list of values in a not sorted numpy array
                            
                                Using multiple cores with Python and Eventlet
                            
                                Data transfer between C++ and Python
                            
                                Pandas merge not keeping 'on' column
                            
                                Django static image not displaying
                            
                                How to find all the groups the user is a member? (LDAP)
                            
                                Using Pandas Styler
                            
                                pymouse.click not interfacing with other software
                            
                                Conditionally change color of files in QListView connected to QFileSystemModel
                            
                                hiding android keyboard in kivy
                            
                                Using downloaded NLTK data on AWS Elastic Beanstalk
                            
                                calling SQL functions from Blaze

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: Process file using multiple cores

Tags:

python

multiprocessing

bublitz

People also ask

1 Answers

Mark Ransom

Recent Activity

Donate For Us