I want to create a number of instances of a class based on values in a pandas.DataFrame
. This I've got down.
import itertools
import multiprocessing as mp
import pandas as pd
class Toy:
id_iter = itertools.count(1)
def __init__(self, row):
self.id = self.id_iter.next()
self.type = row['type']
if __name__ == "__main__":
table = pd.DataFrame({
'type': ['a', 'b', 'c'],
'number': [5000, 4000, 30000]
})
for index, row in table.iterrows():
[Toy(row) for _ in range(row['number'])]
I've been able to parallelize this (sort of) by adding the following:
pool = mp.Pool(processes=mp.cpu_count())
m = mp.Manager()
q = m.Queue()
for index, row in table.iterrows():
pool.apply_async([Toy(row) for _ in range(row['number'])])
It seems that this would be faster if the numbers in row['number']
are substantially longer than the length of table
. But in my actual case, table
is thousands of lines long, and each row['number']
is relatively small.
It seems smarter to try and break up table
into cpu_count()
chunks and iterate within the table. But now we're at the edge of my python skills.
I've tried things that the python interpreter screams at me for, like:
pool.apply_async(
for index, row in table.iterrows():
[Toy(row) for _ in range(row['number'])]
)
Also things that "can't be pickled"
Parallel(n_jobs=4)(
delayed(Toy)([row for _ in range(row['number'])]) \
for index, row in table.iterrows()
)
This may gotten me a little bit closer, but still not there. I create the class instances in a separate function,
def create_toys(row):
[Toy(row) for _ in range(row['number'])]
....
Parallel(n_jobs=4, backend="threading")(
(create_toys)(row) for i, row in table.iterrows()
)
but I'm told 'NoneType' object is not iterable.
Summary: Use the built-in Python method zip() to iterate through two lists in parallel. The zip() method returns an iterator of tuples and the nth item of each iterator can be paired together using the zip() function.
Multiprocessing in Python enables the computer to utilize multiple cores of a CPU to run tasks/processes in parallel. Multiprocessing enables the computer to utilize multiple cores of a CPU to run tasks/processes in parallel.
Iterate over multiple lists at a time We can iterate over lists simultaneously in ways: zip() : In Python 3, zip returns an iterator. zip() function stops when anyone of the list of all the lists gets exhausted. In simple words, it runs till the smallest of all the lists.
Parallel loops are one of the most widely used concepts to express parallelism in parallel languages and libraries. In general, a parallel loop is a loop whose iterations are executed at least partially concurrently by several threads or processes.
It's a little bit unclear to me what the output you are expecting is. Do you just want a big list of the form
[Toy(row_1) ... Toy(row_n)]
where each Toy(row_i)
appears with multiplicity row_i.number
?
Based on the answer mentioned by @JD Long I think you could do something like this:
def process(df):
L = []
for index, row in table.iterrows():
L += [Toy(row) for _ in range(row['number'])]
return L
table = pd.DataFrame({
'type': ['a', 'b', 'c']*10,
'number': [5000, 4000, 30000]*10
})
p = mp.Pool(processes=8)
split_dfs = np.array_split(table,8)
pool_results = p.map(process, split_dfs)
p.close()
p.join()
# merging parts processed by different processes
result = [a for L in pool_results for a in L]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With