I have a pandas series with more than 35000 rows. I want to use dask make it more efficient. However, I both the dask code and the pandas code are taking the same time. Initially "ser" is pandas series and fun1 and fun2 are basic functions performing pattern match in individual rows of series.
Pandas:
ser = ser.apply(fun1).apply(fun2)
Dask:
ser = dd.from_pandas(ser, npartitions = 16)
ser = ser.apply(fun1).apply(fun2)
On checking the status of cores of cpu, I found that not all the cores were getting used. Only one core was getting used to 100%.
Is there any method to make the series code faster using dask or utilize all the cores of cpu while performing Dask operations in series?
That means you can now use Dask to not only speed up computations on datasets using parallel processing, but also build ML models using scikit-learn, XGBoost on much larger datasets. You can use it to scale your python code for data analysis.
Dask Dataframes allows you to work with large datasets for both data manipulation and building ML models with only minimal code changes. It is open source and works well with python libraries like NumPy, scikit-learn, etc. Let’s understand how to use Dask with hands-on examples.*
All of the algorithms implemented in Dask-ML work well on larger than memory datasets, which you might store in a dask array or dataframe. In this example, we’ll use dask_ml.datasets.make_blobs to generate some random dask arrays.
They don’t exchange data, memory, or resources. As per the dask documentation, when parallelizing tasks using processes, Every task and all of its dependencies are shipped to a local process, executed, and then their result is shipped back to the main process. Evidently, the processes have higher task overheads than threads.
See http://dask.pydata.org/en/latest/scheduler-overview.html
It is likely that the functions that you are calling are pure-python, and so claim the GIL, the lock which ensures that only one python instruction is being carried out at a time within a thread. In this case, you will need to run your functions in separate processes to see any parallelism. You could do this by using the multiprocess scheduler
ser = ser.apply(fun1).apply(fun2).compute(scheduler='processes')
or by using the distributed scheduler (which works fine on a single machine, and actually comes with some next-generation benefits, such as the status dashboard); in the simplest, default case, creating a client is enough:
client = dask.distributed.Client()
but you should read the docs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With