Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel python iteration

I want to create a number of instances of a class based on values in a pandas.DataFrame. This I've got down.

import itertools
import multiprocessing as mp
import pandas as pd

class Toy:
    id_iter = itertools.count(1)

    def __init__(self, row):
        self.id = self.id_iter.next()
        self.type = row['type']

if __name__ == "__main__":

    table = pd.DataFrame({
        'type': ['a', 'b', 'c'],
        'number': [5000, 4000, 30000]
        })

    for index, row in table.iterrows():
        [Toy(row) for _ in range(row['number'])]

Multiprocessing Attempts

I've been able to parallelize this (sort of) by adding the following:

pool = mp.Pool(processes=mp.cpu_count())
m = mp.Manager()
q = m.Queue()

for index, row in table.iterrows():
    pool.apply_async([Toy(row) for _ in range(row['number'])])

It seems that this would be faster if the numbers in row['number'] are substantially longer than the length of table. But in my actual case, table is thousands of lines long, and each row['number'] is relatively small.

It seems smarter to try and break up table into cpu_count() chunks and iterate within the table. But now we're at the edge of my python skills.

I've tried things that the python interpreter screams at me for, like:

pool.apply_async(
        for index, row in table.iterrows(): 
        [Toy(row) for _ in range(row['number'])]
        )

Also things that "can't be pickled"

Parallel(n_jobs=4)(
    delayed(Toy)([row for _ in range(row['number'])]) \
            for index, row in table.iterrows()
)

Edit

This may gotten me a little bit closer, but still not there. I create the class instances in a separate function,

def create_toys(row):
    [Toy(row) for _ in range(row['number'])]

....

Parallel(n_jobs=4, backend="threading")(
    (create_toys)(row) for i, row in table.iterrows()
)

but I'm told 'NoneType' object is not iterable.

like image 225
gregmacfarlane Avatar asked Jun 09 '15 19:06

gregmacfarlane


People also ask

How do you iterate in parallel in Python?

Summary: Use the built-in Python method zip() to iterate through two lists in parallel. The zip() method returns an iterator of tuples and the nth item of each iterator can be paired together using the zip() function.

Can Python run code in parallel?

Multiprocessing in Python enables the computer to utilize multiple cores of a CPU to run tasks/processes in parallel. Multiprocessing enables the computer to utilize multiple cores of a CPU to run tasks/processes in parallel.

Can I iterate through 2 lists Python?

Iterate over multiple lists at a time We can iterate over lists simultaneously in ways: zip() : In Python 3, zip returns an iterator. zip() function stops when anyone of the list of all the lists gets exhausted. In simple words, it runs till the smallest of all the lists.

What is a parallel loop?

Parallel loops are one of the most widely used concepts to express parallelism in parallel languages and libraries. In general, a parallel loop is a loop whose iterations are executed at least partially concurrently by several threads or processes.


1 Answers

It's a little bit unclear to me what the output you are expecting is. Do you just want a big list of the form

[Toy(row_1) ... Toy(row_n)]

where each Toy(row_i) appears with multiplicity row_i.number?

Based on the answer mentioned by @JD Long I think you could do something like this:

def process(df):
    L = []
    for index, row in table.iterrows():
        L += [Toy(row) for _ in range(row['number'])]
    return L

table = pd.DataFrame({
    'type': ['a', 'b', 'c']*10,
    'number': [5000, 4000, 30000]*10
    })

p = mp.Pool(processes=8)
split_dfs = np.array_split(table,8)    
pool_results = p.map(process, split_dfs)
p.close()
p.join()

# merging parts processed by different processes
result = [a for L in pool_results for a in L]
like image 63
maxymoo Avatar answered Sep 26 '22 19:09

maxymoo