I have a pandas series with more than 35000 rows. I want to use dask make it more efficient. However, I both the dask code and the pandas code are taking the same time. Initially "ser" is pandas series and fun1 and fun2 are basic functions performing pattern match in individual rows of series. Pandas: <pre class="prettyprint"><code>ser = ser.apply(fun1).apply(fun2) </code></pre> Dask: <pre class="prettyprint"><code>ser = dd.from_pandas(ser, npartitions = 16) ser = ser.apply(fun1).apply(fun2) </code></pre> On checking the status of cores of cpu, I found that not all the cores were getting used. Only one core was getting used to 100%. Is there any method to make the series code faster using dask or utilize all the cores of cpu while performing Dask operations in series?

See http://dask.pydata.org/en/latest/scheduler-overview.html It is likely that the functions that you are calling are pure-python, and so claim the GIL, the lock which ensures that only one python instruction is being carried out at a time within a thread. In this case, you will need to run your functions in separate processes to see any parallelism. You could do this by using the multiprocess scheduler <pre class="prettyprint"><code>ser = ser.apply(fun1).apply(fun2).compute(scheduler='processes') </code></pre> or by using the distributed scheduler (which works fine on a single machine, and actually comes with some next-generation benefits, such as the status dashboard); in the simplest, default case, creating a client is enough: <pre class="prettyprint"><code>client = dask.distributed.Client() </code></pre> but you should read the docs

How to use all the cpu cores using Dask?

Tags:

dask

dask-distributed

dask-delayed

I have a pandas series with more than 35000 rows. I want to use dask make it more efficient. However, I both the dask code and the pandas code are taking the same time. Initially "ser" is pandas series and fun1 and fun2 are basic functions performing pattern match in individual rows of series.

Pandas:

ser = ser.apply(fun1).apply(fun2)

Dask:

ser = dd.from_pandas(ser, npartitions = 16)
ser = ser.apply(fun1).apply(fun2)

On checking the status of cores of cpu, I found that not all the cores were getting used. Only one core was getting used to 100%.

Is there any method to make the series code faster using dask or utilize all the cores of cpu while performing Dask operations in series?

317

asked Jul 06 '18 14:07

ANKIT JHA

1 Answers

See http://dask.pydata.org/en/latest/scheduler-overview.html

It is likely that the functions that you are calling are pure-python, and so claim the GIL, the lock which ensures that only one python instruction is being carried out at a time within a thread. In this case, you will need to run your functions in separate processes to see any parallelism. You could do this by using the multiprocess scheduler

ser = ser.apply(fun1).apply(fun2).compute(scheduler='processes')

or by using the distributed scheduler (which works fine on a single machine, and actually comes with some next-generation benefits, such as the status dashboard); in the simplest, default case, creating a client is enough:

client = dask.distributed.Client()

but you should read the docs

165

answered Sep 19 '22 22:09

mdurant

Related questions
                            
                                iterate over GroupBy object in dask
                            
                                Unpacking result of delayed function
                            
                                dask DataFrame equivalent of pandas DataFrame sort_values
                            
                                What causes dask job failure with CancelledError exception
                            
                                Concatenating a dask dataframe and a pandas dataframe
                            
                                R equivalent of Python's dask
                            
                                How to use pandas.cut() (or equivalent) in dask efficiently?
                            
                                Create sql table from dask dataframe using map_partitions and pd.df.to_sql
                            
                                Dask: delayed vs futures and task graph generation [closed]
                            
                                Applying Python function to Pandas grouped DataFrame - what's the most efficient approach to speed up the computations?
                            
                                Writing xarray multiindex data in chunks
                            
                                How to concat multiple pandas dataframes into one dask dataframe larger than memory?
                            
                                Create an if-else condition column in dask dataframe
                            
                                Understanding memory behavior of Dask distributed
                            
                                how to throttle a large number of tasks without using all workers
                            
                                How to read parquet file from s3 using dask with specific AWS profile
                            
                                Dask Dataframe: Get row count?
                            
                                Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement
                            
                                How do I run a dask.distributed cluster in a single thread?
                            
                                Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With