Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to map a column with dask

I want to apply a mapping on a DataFrame column. With Pandas this is straight forward:

df["infos"] = df2["numbers"].map(lambda nr: custom_map(nr, hashmap))

This writes the infos column, based on the custom_map function, and uses the rows in numbers for the lambda statement.

With dask this isn't that simple. ddf is a dask DataFrame. map_partitions is the equivalent to parallel execution of the mapping on a part of the DataFrame.

This does not work because you don't define columns like that in dask.

ddf["infos"] = ddf2["numbers"].map_partitions(lambda nr: custom_map(nr, hashmap))

Does anyone know how I can use columns here? I don't understand their API documentation at all.

like image 889
wishi Avatar asked Oct 13 '16 11:10

wishi


People also ask

What type of Dataframe should I use for DASK control?

At least one of the args should be a Dask.dataframe. Arguments and keywords may contain Scalar, Delayed or regular python objects. DataFrame-like args (both dask and pandas) will be repartitioned to align (if necessary) before applying the function (see align_dataframes to control).

Is it possible to mutate the input of a DASK column?

The Dask documentation said to not mutate the input, but you're mutating df by assigning columns to it. Is this just a misunderstanding on my part? dask.pydata.org/en/latest/…

What are the differences between DASK version and mapping correspondence?

Some inconsistencies with the Dask version may exist. Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series. Mapping correspondence. If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.

How do single-partition inputs work in DASK Dataframe?

Single-partition inputs will be broadcast to every partition of multi-partition inputs. An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available.


1 Answers

You can use the .map method, exactly as in Pandas

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({'x': [1, 2, 3]})

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: df.x.map(lambda x: x + 1)
Out[5]: 
0    2
1    3
2    4
Name: x, dtype: int64

In [6]: ddf.x.map(lambda x: x + 1).compute()
Out[6]: 
0    2
1    3
2    4
Name: x, dtype: int64

Metadata

You may be asked to provide a meta= keyword. This lets dask.dataframe know the output name and type of your function. Copying the docstring from map_partitions here:

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and 
column names of the output. This metadata is necessary for many 
algorithms in dask dataframe to work. For ease of use, some 
alternative inputs are also available. Instead of a DataFrame, 
a dict of {name: dtype} or iterable of (name, dtype) can be 
provided. Instead of a series, a tuple of (name, dtype) can be 
used. If not provided, dask will try to infer the metadata. 
This may lead to unexpected results, so providing meta is  
recommended. 

For more information, see dask.dataframe.utils.make_meta.

So in the example above, where my output will be a series with name 'x' and dtype int I can do either of the following to be more explicit

>>> ddf.x.map(lambda x: x + 1, meta=('x', int))

or

>>> ddf.x.map(lambda x: x + 1, meta=pd.Series([], dtype=int, name='x'))

This tells dask.dataframe what to expect from our function. If no meta is given then dask.dataframe will try running your function on a little piece of data. It will raise an error asking for help if this fails.

like image 62
MRocklin Avatar answered Oct 20 '22 05:10

MRocklin