How to map a column with dask

Tags:

I want to apply a mapping on a DataFrame column. With Pandas this is straight forward:

df["infos"] = df2["numbers"].map(lambda nr: custom_map(nr, hashmap))

This writes the infos column, based on the custom_map function, and uses the rows in numbers for the lambda statement.

With dask this isn't that simple. ddf is a dask DataFrame. map_partitions is the equivalent to parallel execution of the mapping on a part of the DataFrame.

This does not work because you don't define columns like that in dask.

ddf["infos"] = ddf2["numbers"].map_partitions(lambda nr: custom_map(nr, hashmap))

Does anyone know how I can use columns here? I don't understand their API documentation at all.

889

asked Oct 13 '16 11:10

wishi

1 Answers

You can use the .map method, exactly as in Pandas

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({'x': [1, 2, 3]})

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: df.x.map(lambda x: x + 1)
Out[5]: 
0    2
1    3
2    4
Name: x, dtype: int64

In [6]: ddf.x.map(lambda x: x + 1).compute()
Out[6]: 
0    2
1    3
2    4
Name: x, dtype: int64

Metadata

You may be asked to provide a meta= keyword. This lets dask.dataframe know the output name and type of your function. Copying the docstring from map_partitions here:

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and 
column names of the output. This metadata is necessary for many 
algorithms in dask dataframe to work. For ease of use, some 
alternative inputs are also available. Instead of a DataFrame, 
a dict of {name: dtype} or iterable of (name, dtype) can be 
provided. Instead of a series, a tuple of (name, dtype) can be 
used. If not provided, dask will try to infer the metadata. 
This may lead to unexpected results, so providing meta is  
recommended. 

For more information, see dask.dataframe.utils.make_meta.

So in the example above, where my output will be a series with name 'x' and dtype int I can do either of the following to be more explicit

>>> ddf.x.map(lambda x: x + 1, meta=('x', int))

>>> ddf.x.map(lambda x: x + 1, meta=pd.Series([], dtype=int, name='x'))

This tells dask.dataframe what to expect from our function. If no meta is given then dask.dataframe will try running your function on a little piece of data. It will raise an error asking for help if this fails.

answered Oct 20 '22 05:10

MRocklin

Related questions
                            
                                Numpy array slicing using colons
                            
                                Kneser-Ney smoothing of trigrams using Python NLTK
                            
                                Python support sorted dictionary -- similar to C++ map?
                            
                                Setting a fixed FPS in Pygame, Python 3
                            
                                Element-wise constraints in scipy.optimize.minimize
                            
                                Moving every other row to a new column and group pandas python
                            
                                convert numeric sas date to datetime in Pandas
                            
                                'AttributeError: 'module' object has no attribute 'file'' when using oauth2client with Google Calendar
                            
                                python pandas read_csv quotechar does not work
                            
                                Python error message "Incompatible library version" libxml and etree.so
                            
                                how to use python "get()" for keys deeper than first level of dictionary keys?
                            
                                Issue with UTF-/ encoding on csv file for excel
                            
                                How to accumulate an array by index in numpy? [duplicate]
                            
                                Use of 'random_state' parameter in sklearn.utils.shuffle?
                            
                                Reading a github file using python returns HTML tags
                            
                                Unable to install Statsmodels...python
                            
                                How to ignore NULL byte when reading a csv file
                            
                                How do I apply both bold and italics in python-docx?
                            
                                Python: ImportError: No module named 'tutorial.quickstart'
                            
                                How to rename (exposed in API) filter field name using django-filters?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to map a column with dask

Tags:

python

pandas

dask

wishi

People also ask

1 Answers

Metadata

MRocklin

Recent Activity

Donate For Us