I want to create a number of instances of a class based on values in a <code>pandas.DataFrame</code>. This I've got down. <pre class="prettyprint"><code>import itertools import multiprocessing as mp import pandas as pd class Toy: id_iter = itertools.count(1) def __init__(self, row): self.id = self.id_iter.next() self.type = row['type'] if __name__ == "__main__": table = pd.DataFrame({ 'type': ['a', 'b', 'c'], 'number': [5000, 4000, 30000] }) for index, row in table.iterrows(): [Toy(row) for _ in range(row['number'])] </code></pre> <h3>Multiprocessing Attempts</h3> I've been able to parallelize this (sort of) by adding the following: <pre class="prettyprint"><code>pool = mp.Pool(processes=mp.cpu_count()) m = mp.Manager() q = m.Queue() for index, row in table.iterrows(): pool.apply_async([Toy(row) for _ in range(row['number'])]) </code></pre> It seems that this would be faster if the numbers in <code>row['number']</code> are substantially longer than the length of <code>table</code>. But in my actual case, <code>table</code> is thousands of lines long, and each <code>row['number']</code> is relatively small. It seems smarter to try and break up <code>table</code> into <code>cpu_count()</code> chunks and iterate within the table. But now we're at the edge of my python skills. I've tried things that the python interpreter screams at me for, like: <pre class="prettyprint"><code>pool.apply_async( for index, row in table.iterrows(): [Toy(row) for _ in range(row['number'])] ) </code></pre> Also things that "can't be pickled" <pre class="prettyprint"><code>Parallel(n_jobs=4)( delayed(Toy)([row for _ in range(row['number'])]) \ for index, row in table.iterrows() ) </code></pre> <h3>Edit</h3> This may gotten me a little bit closer, but still not there. I create the class instances in a separate function, <pre class="prettyprint"><code>def create_toys(row): [Toy(row) for _ in range(row['number'])] .... Parallel(n_jobs=4, backend="threading")( (create_toys)(row) for i, row in table.iterrows() ) </code></pre> but I'm told 'NoneType' object is not iterable.

It's a little bit unclear to me what the output you are expecting is. Do you just want a big list of the form <pre class="prettyprint"><code>[Toy(row_1) ... Toy(row_n)] </code></pre> where each <code>Toy(row_i)</code> appears with multiplicity <code>row_i.number</code>? Based on the answer mentioned by @JD Long I think you could do something like this: <pre class="prettyprint"><code>def process(df): L = [] for index, row in table.iterrows(): L += [Toy(row) for _ in range(row['number'])] return L table = pd.DataFrame({ 'type': ['a', 'b', 'c']*10, 'number': [5000, 4000, 30000]*10 }) p = mp.Pool(processes=8) split_dfs = np.array_split(table,8) pool_results = p.map(process, split_dfs) p.close() p.join() # merging parts processed by different processes result = [a for L in pool_results for a in L] </code></pre>

Parallel python iteration

Tags:

python

pandas

python-multiprocessing

I want to create a number of instances of a class based on values in a pandas.DataFrame. This I've got down.

import itertools
import multiprocessing as mp
import pandas as pd

class Toy:
    id_iter = itertools.count(1)

    def __init__(self, row):
        self.id = self.id_iter.next()
        self.type = row['type']

if __name__ == "__main__":

    table = pd.DataFrame({
        'type': ['a', 'b', 'c'],
        'number': [5000, 4000, 30000]
        })

    for index, row in table.iterrows():
        [Toy(row) for _ in range(row['number'])]

Multiprocessing Attempts

I've been able to parallelize this (sort of) by adding the following:

pool = mp.Pool(processes=mp.cpu_count())
m = mp.Manager()
q = m.Queue()

for index, row in table.iterrows():
    pool.apply_async([Toy(row) for _ in range(row['number'])])

It seems that this would be faster if the numbers in row['number'] are substantially longer than the length of table. But in my actual case, table is thousands of lines long, and each row['number'] is relatively small.

It seems smarter to try and break up table into cpu_count() chunks and iterate within the table. But now we're at the edge of my python skills.

I've tried things that the python interpreter screams at me for, like:

pool.apply_async(
        for index, row in table.iterrows(): 
        [Toy(row) for _ in range(row['number'])]
        )

Also things that "can't be pickled"

Parallel(n_jobs=4)(
    delayed(Toy)([row for _ in range(row['number'])]) \
            for index, row in table.iterrows()
)

Edit

This may gotten me a little bit closer, but still not there. I create the class instances in a separate function,

def create_toys(row):
    [Toy(row) for _ in range(row['number'])]

....

Parallel(n_jobs=4, backend="threading")(
    (create_toys)(row) for i, row in table.iterrows()
)

but I'm told 'NoneType' object is not iterable.

225

asked Jun 09 '15 19:06

gregmacfarlane

1 Answers

It's a little bit unclear to me what the output you are expecting is. Do you just want a big list of the form

[Toy(row_1) ... Toy(row_n)]

where each Toy(row_i) appears with multiplicity row_i.number?

Based on the answer mentioned by @JD Long I think you could do something like this:

def process(df):
    L = []
    for index, row in table.iterrows():
        L += [Toy(row) for _ in range(row['number'])]
    return L

table = pd.DataFrame({
    'type': ['a', 'b', 'c']*10,
    'number': [5000, 4000, 30000]*10
    })

p = mp.Pool(processes=8)
split_dfs = np.array_split(table,8)    
pool_results = p.map(process, split_dfs)
p.close()
p.join()

# merging parts processed by different processes
result = [a for L in pool_results for a in L]

answered Sep 26 '22 19:09

maxymoo

Related questions
                            
                                Looking for recommendation on how to convert PDF into structured format
                            
                                python cgitb is not functioning through a browser
                            
                                Cython: Create memoryview without NumPy array?
                            
                                Why is my A* implementation slower than floodfill?
                            
                                How does one declare a dependency on gi.repository in setup.py (and required C library)?
                            
                                Can I make partial plots for DecisionTreeClassifier in scikit-learn (and R)
                            
                                pandas dtype conversion from object to string
                            
                                combining python watchdog with multiprocessing or threading
                            
                                POST data to CGI file using XMLHttpRequest causes BadHeader
                            
                                cx_Oracle - DLL load failed
                            
                                Add color to nose2 output
                            
                                Raising an exception while using numba
                            
                                Use existing authenticated session from browser to perform https request on python
                            
                                pycharm code autocomplete work only in python console but not on python file
                            
                                Python / Flask / MongoEngine DateTimeField
                            
                                Correct way to convert a string to the proper type for an NDB property?
                            
                                Failing to convert Pandas dataframe timestamp
                            
                                How to get the signature parameters of a callable, or reliably determine when this is not possible?
                            
                                Flask CRUD programming without SQLAlchemy or other ORM
                            
                                NSException with Tkinter on mac

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With