python dask DataFrame, support for (trivially parallelizable) row apply?

Tags:

I recently found dask module that aims to be an easy-to-use python parallel processing module. Big selling point for me is that it works with pandas.

After reading a bit on its manual page, I can't find a way to do this trivially parallelizable task:

ts.apply(func) # for pandas series df.apply(func, axis = 1) # for pandas DF row apply

At the moment, to achieve this in dask, AFAIK,

ddf.assign(A=lambda df: df.apply(func, axis=1)).compute() # dask DataFrame

which is ugly syntax and is actually slower than outright

df.apply(func, axis = 1) # for pandas DF row apply

Any suggestion?

Edit: Thanks @MRocklin for the map function. It seems to be slower than plain pandas apply. Is this related to pandas GIL releasing issue or am I doing it wrong?

import dask.dataframe as dd s = pd.Series([10000]*120) ds = dd.from_pandas(s, npartitions = 3)  def slow_func(k):     A = np.random.normal(size = k) # k = 10000     s = 0     for a in A:         if a > 0:             s += 1         else:             s -= 1     return s  s.apply(slow_func) # 0.43 sec ds.map(slow_func).compute() # 2.04 sec

291

asked Jul 11 '15 20:07

jf328

1 Answers

`map_partitions`

You can apply your function to all of the partitions of your dataframe with the map_partitions function.

df.map_partitions(func, columns=...)

Note that func will be given only part of the dataset at a time, not the entire dataset like with pandas apply (which presumably you wouldn't want if you want to do parallelism.)

`map` / `apply`

You can map a function row-wise across a series with map

df.mycolumn.map(func)

You can map a function row-wise across a dataframe with apply

df.apply(func, axis=1)

Threads vs Processes

As of version 0.6.0 dask.dataframes parallelizes with threads. Custom Python functions will not receive much benefit from thread-based parallelism. You could try processes instead

df = dd.read_csv(...)  df.map_partitions(func, columns=...).compute(scheduler='processes')

But avoid `apply`

However, you should really avoid apply with custom Python functions, both in Pandas and in Dask. This is often a source of poor performance. It could be that if you find a way to do your operation in a vectorized manner then it could be that your Pandas code will be 100x faster and you won't need dask.dataframe at all.

Consider `numba`

For your particular problem you might consider numba. This significantly improves your performance.

In [1]: import numpy as np In [2]: import pandas as pd In [3]: s = pd.Series([10000]*120)  In [4]: %paste def slow_func(k):     A = np.random.normal(size = k) # k = 10000     s = 0     for a in A:         if a > 0:             s += 1         else:             s -= 1     return s ## -- End pasted text --  In [5]: %time _ = s.apply(slow_func) CPU times: user 345 ms, sys: 3.28 ms, total: 348 ms Wall time: 347 ms  In [6]: import numba In [7]: fast_func = numba.jit(slow_func)  In [8]: %time _ = s.apply(fast_func)  # First time incurs compilation overhead CPU times: user 179 ms, sys: 0 ns, total: 179 ms Wall time: 175 ms  In [9]: %time _ = s.apply(fast_func)  # Subsequent times are all gain CPU times: user 68.8 ms, sys: 27 µs, total: 68.8 ms Wall time: 68.7 ms

Disclaimer, I work for the company that makes both numba and dask and employs many of the pandas developers.

199

answered Sep 25 '22 10:09

MRocklin

Related questions
                            
                                Difference between asfreq and resample
                            
                                Convert Pandas dataframe to Sparse Numpy Matrix directly
                            
                                How to set cookie in Python Flask?
                            
                                How to close a socket left open by a killed program?
                            
                                import vs __import__( ) vs importlib.import_module( )?
                            
                                How can I graph the Lines of Code history for git repo?
                            
                                why does scikitlearn says F1 score is ill-defined with FN bigger than 0?
                            
                                Could not translate host name "db" to address using Postgres, Docker Compose and Psycopg2
                            
                                Sorting a dictionary with lists as values, according to an element from the list
                            
                                Using .format() to format a list with field width arguments
                            
                                Why do integers in database row tuple have an 'L' suffix?
                            
                                How to cast tuple into namedtuple?
                            
                                How to combine Celery with asyncio?
                            
                                How to stop Tkinter Frame from shrinking to fit its contents?
                            
                                How to parse timezone with colon
                            
                                Break on exception in pydev
                            
                                __init__ and arguments in Python
                            
                                Django ImportError: cannot import name 'render_to_response' from 'django.shortcuts'
                            
                                tempfile.TemporaryDirectory context manager in Python 2.7
                            
                                What is the difference between Image.resize and Image.thumbnail in Pillow-Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python dask DataFrame, support for (trivially parallelizable) row apply?

Tags:

python

pandas

parallel-processing

dask

jf328

People also ask

1 Answers

`map_partitions`

`map` / `apply`

Threads vs Processes

But avoid `apply`

Consider `numba`

MRocklin

Recent Activity

Donate For Us

python dask DataFrame, support for (trivially parallelizable) row apply?

Tags:

python

pandas

parallel-processing

dask

jf328

People also ask

1 Answers

map_partitions

map / apply

Threads vs Processes

But avoid apply

Consider numba

MRocklin

Related questions

Recent Activity

Donate For Us

`map_partitions`

`map` / `apply`

But avoid `apply`

Consider `numba`