I read the following SO thead and now am trying to understand it. Here is my example: <pre class="prettyprint"><code>import dask.dataframe as dd import pandas as pd from dask.multiprocessing import get import random df = pd.DataFrame({'col_1':random.sample(range(10000), 10000), 'col_2': random.sample(range(10000), 10000) }) def test_f(col_1, col_2): return col_1*col_2 ddf = dd.from_pandas(df, npartitions=8) ddf['result'] = ddf.map_partitions(test_f, columns=['col_1', 'col_2']).compute(get=get) </code></pre> It generates the following error below. What am I doing wrong? Also I am not clear how to pass additional parameters to function in <code>map_partitions</code>? <pre class="prettyprint"><code>--------------------------------------------------------------------------- TypeError Traceback (most recent call last) ~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\utils.py in raise_on_meta_error(funcname) 136 try: --> 137 yield 138 except Exception as e: ~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in _emulate(func, *args, **kwargs) 3130 with raise_on_meta_error(funcname(func)): -> 3131 return func(*_extract_meta(args, True), **_extract_meta(kwargs, True)) 3132 TypeError: test_f() got an unexpected keyword argument 'columns' During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) <ipython-input-9-913789c7326c> in <module>() ----> 1 ddf['result'] = ddf.map_partitions(test_f, columns=['col_1', 'col_2']).compute(get=get) ~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in map_partitions(self, func, *args, **kwargs) 469 >>> ddf.map_partitions(func).clear_divisions() # doctest: +SKIP 470 """ --> 471 return map_partitions(func, self, *args, **kwargs) 472 473 @insert_meta_param_description(pad=12) ~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in map_partitions(func, *args, **kwargs) 3163 3164 if meta is no_default: -> 3165 meta = _emulate(func, *args, **kwargs) 3166 3167 if all(isinstance(arg, Scalar) for arg in args): ~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in _emulate(func, *args, **kwargs) 3129 """ 3130 with raise_on_meta_error(funcname(func)): -> 3131 return func(*_extract_meta(args, True), **_extract_meta(kwargs, True)) 3132 3133 ~\AppData\Local\conda\conda\envs\tensorflow\lib\contextlib.py in __exit__(self, type, value, traceback) 75 value = type() 76 try: ---> 77 self.gen.throw(type, value, traceback) 78 except StopIteration as exc: 79 # Suppress StopIteration *unless* it's the same exception that ~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\utils.py in raise_on_meta_error(funcname) 148 ).format(" in `{0}`".format(funcname) if funcname else "", 149 repr(e), tb) --> 150 raise ValueError(msg) 151 152 ValueError: Metadata inference failed in `test_f`. Original error is below: ------------------------ TypeError("test_f() got an unexpected keyword argument 'columns'",) Traceback: --------- File "C:\Users\some_user\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\utils.py", line 137, in raise_on_meta_error yield File "C:\Users\some_user\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py", line 3131, in _emulate return func(*_extract_meta(args, True), **_extract_meta(kwargs, True)) </code></pre>

Your <code>test_f</code> takes two arguments: <code>col_1</code> and <code>col_2</code>. You pass a single argument, <code>ddf</code>. Try something like <pre class="prettyprint"><code>In [5]: dd.map_partitions(test_f, ddf['col_1'], ddf['col_2']) Out[5]: Dask Series Structure: npartitions=8 0 int64 1250 ... ... 8750 ... 9999 ... dtype: int64 Dask Name: test_f, 32 tasks </code></pre>

simple dask map_partitions example

Tags:

python

parallel-processing

dask

I read the following SO thead and now am trying to understand it. Here is my example:

Click to copy

import dask.dataframe as dd
import pandas as pd
from dask.multiprocessing import get
import random

df = pd.DataFrame({'col_1':random.sample(range(10000), 10000), 'col_2': random.sample(range(10000), 10000) })

def test_f(col_1, col_2):
    return col_1*col_2

ddf = dd.from_pandas(df, npartitions=8)

ddf['result'] = ddf.map_partitions(test_f, columns=['col_1', 'col_2']).compute(get=get)

It generates the following error below. What am I doing wrong? Also I am not clear how to pass additional parameters to function in map_partitions?

Click to copy

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\utils.py in raise_on_meta_error(funcname)
    136     try:
--> 137         yield
    138     except Exception as e:

~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in _emulate(func, *args, **kwargs)
   3130     with raise_on_meta_error(funcname(func)):
-> 3131         return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
   3132 

TypeError: test_f() got an unexpected keyword argument 'columns'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-9-913789c7326c> in <module>()
----> 1 ddf['result'] = ddf.map_partitions(test_f, columns=['col_1', 'col_2']).compute(get=get)

~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in map_partitions(self, func, *args, **kwargs)
    469         >>> ddf.map_partitions(func).clear_divisions()  # doctest: +SKIP
    470         """
--> 471         return map_partitions(func, self, *args, **kwargs)
    472 
    473     @insert_meta_param_description(pad=12)

~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in map_partitions(func, *args, **kwargs)
   3163 
   3164     if meta is no_default:
-> 3165         meta = _emulate(func, *args, **kwargs)
   3166 
   3167     if all(isinstance(arg, Scalar) for arg in args):

~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in _emulate(func, *args, **kwargs)
   3129     """
   3130     with raise_on_meta_error(funcname(func)):
-> 3131         return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
   3132 
   3133 

~\AppData\Local\conda\conda\envs\tensorflow\lib\contextlib.py in __exit__(self, type, value, traceback)
     75                 value = type()
     76             try:
---> 77                 self.gen.throw(type, value, traceback)
     78             except StopIteration as exc:
     79                 # Suppress StopIteration *unless* it's the same exception that

~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\utils.py in raise_on_meta_error(funcname)
    148                ).format(" in `{0}`".format(funcname) if funcname else "",
    149                         repr(e), tb)
--> 150         raise ValueError(msg)
    151 
    152 

ValueError: Metadata inference failed in `test_f`.

Original error is below:
------------------------
TypeError("test_f() got an unexpected keyword argument 'columns'",)

Traceback:
---------
  File "C:\Users\some_user\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\utils.py", line 137, in raise_on_meta_error
    yield
  File "C:\Users\some_user\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py", line 3131, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))

551

asked Nov 05 '17 19:11

user1700890

2 Answers

There is an example in map_partitions docs to achieve exactly what are trying to do:

Click to copy

ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))

When you call map_partitions (just like when you call .apply() on pandas.DataFrame), the function that you try to map (or apply) will be given dataframe as a first argument.

In case of dask.dataframe.map_partitions this first argument will be a partition and in case of pandas.DataFrame.apply - a whole dataframe.

Which means that your function has to accept dataframe(partition) as a first argument and and in your case could look like this:

Click to copy

def test_f(df, col_1, col_2):
    return df.assign(result=df[col_1] * df[col_2])

Note that assignment of a new column in this case happens (i.e. gets scheduled to happen) BEFORE you call .compute().

In your example you assign column AFTER you call .compute(), which kind of defeats the purpose of using dask. I.e. after you call .compute() the results of that operation are loaded into memory if there is enough space for those results (if not you just get MemoryError).

So for you example to work you could:

1) Use function (with column names as arguments):

Click to copy

def test_f(df, col_1, col_2):
    return df.assign(result=df[col_1] * df[col_2])


ddf_out = ddf.map_partitions(test_f, 'col_1', 'col_2')

# Here is good place to do something with BIG ddf_out dataframe before calling .compute()

result = ddf_out.compute(get=get)  # Will load the whole dataframe into memory

2) Use lambda (with column names hardcoded in the function):

Click to copy

ddf_out = ddf.map_partitions(lambda df: df.assign(result=df.col_1 * df.col_2))

# Here is good place to do something with BIG ddf_out dataframe before calling .compute()

result = ddf_out.compute(get=get)  # Will load the whole dataframe into memory

Update:

To apply function on a row-by-row basis, here is a quote from the post you linked:

map / apply

You can map a function row-wise across a series with map

Click to copy
df.mycolumn.map(func)
You can map a function row-wise across a dataframe with apply

Click to copy
df.apply(func, axis=1)

I.e. for the example function in your question, it might look like this:

Click to copy

def test_f(dds, col_1, col_2):
    return dds[col_1] * dds[col_2]

Since you will be applying it on a row-by-row basis the function's first argument will be a series (i.e. each row of a dataframe is a series).

To apply this function then you might call it like this:

Click to copy

dds_out = ddf.apply(
    test_f, 
    args=('col_1', 'col_2'), 
    axis=1, 
    meta=('result', int)
).compute(get=get)

This will return a series named 'result'.

I guess you could also call .apply on each partition with a function but it does not look to be any more efficient then calling .apply on dataframe directly. But may be your tests will prove otherwise.

134

answered Oct 03 '22 17:10

Primer

Your test_f takes two arguments: col_1 and col_2. You pass a single argument, ddf.

Try something like

Click to copy

In [5]: dd.map_partitions(test_f, ddf['col_1'], ddf['col_2'])
Out[5]:
Dask Series Structure:
npartitions=8
0       int64
1250      ...
        ...
8750      ...
9999      ...
dtype: int64
Dask Name: test_f, 32 tasks

answered Oct 03 '22 17:10

TomAugspurger

Related questions
                            
                                How to count one specific word in Python?
                            
                                Select xarray/pandas index based on specific months
                            
                                Meaning of '\0\0' in Python?
                            
                                How to extract digits from a number from left to right?
                            
                                When to use one or two underscore in Python [duplicate]
                            
                                How to solve 'module' object has no attribute '_base' issue?
                            
                                ModuleNotFoundError: No module named 'bs4'
                            
                                How to check if a list contains a boolean value
                            
                                Python: convert matrix to positive semi-definite
                            
                                Drop Duplicates in a DataFrame Keeping the Row with the Least Nulls
                            
                                How to read/write a matrix from a persistent XML/YAML file in OpenCV 3 with python?
                            
                                Anaconda Python: Delete .tar.gz in pkgs
                            
                                Python TA-lib install error, how solve it?
                            
                                django.db.utils.ProgrammingError: cannot cast type text[] to jsonb
                            
                                Evaluating the LightFM Recommendation Model
                            
                                Best parameters solved by Hyperopt is unsuitable
                            
                                How to unpack a list of tuples with enumerate? [duplicate]
                            
                                Python yield (migrating from Ruby): How can I write a function without arguments and only with yield to do prints?
                            
                                with python, select repeated elements longer than N
                            
                                Python 3 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

simple dask map_partitions example

Tags:

python

parallel-processing

dask

user1700890

People also ask

2 Answers

Primer

TomAugspurger

Recent Activity

Donate For Us