Python Dask - dataframe.map_partitions() return value

Tags:

So dask.dataframe.map_partitions() takes a func argument and the meta kwarg. How exactly does it decide its return type? As an example:

Lots of csv's in ...\some_folder.

ddf = dd.read_csv(r"...\some_folder\*", usecols=['ColA', 'ColB'], 
                                        blocksize=None, 
                                        dtype={'ColA': np.float32, 'ColB': np.float32})
example_func = lambda x: x.iloc[-1] / len(x)
metaResult = pd.Series({'ColA': .1234, 'ColB': .1234})
result = ddf.map_partitions(example_func, meta=metaResult).compute()

I'm pretty new to "distributed" computing, but I would intuitively expect this to return a collection (a list or dict, most likely) of Series objects, yet the result is a Series object that could be considered a concatenation of the results of example_func on each partition. This in and of itself would also suffice, if this series had a MultiIndex to indicate the partition label.

From what I can tell from this question, the docs, and the source code itself, this is because ddf.divisions will return a (None, None, ..., None) as a result of reading csv's? Is there a dask-native way to do this, or do I need to manually go in and break the returned Series (a concatenation of the Series that were returned by example_func on each partition) myself?

Also, feel free to correct my assumptions/practices here, as I'm new to dask.

877

asked Nov 17 '16 18:11

StarFox

1 Answers

So dask.dataframe.map_partitions() takes a func argument and the meta kwarg. How exactly does it decide its return type?

map_partition tries to concatenate the results returned by func to either a dask DataFrame or a dask Series object in an 'intelligent' way. This decision is based on the return value of func:

If func returns a scalar, map_partitions returns a dask Series object.
If func returns a pd.Series object, map_partition returns a dask Series object, in which all pd.Series objects returned by func are concatenated.
If func returns a pd.DataFrame, map_partitions returns a dask Dataframe object, in which these pd.DataFrame obejcts are concatenated along the first axis.

If you are interested in the result of a special partition, you could use get_partition(). If the partition label is in general an important information for you, I would consider to assign a separate column of your ddf directly after reading in the data from csv, which contains all the information you need. Afterwards, you could construct func in a way, that it returns a pd.DataFrame containing the result of your calculation in one column and the information you need to identify the result in another.

132

answered Oct 13 '22 22:10

Arco Bast

Related questions
                            
                                Global name 'camera' is not defined in python
                            
                                python SSL certificate validation fails on some distribution commands
                            
                                Argparse: Default values not shown for subparsers
                            
                                compile translation files when calling setup.py install
                            
                                Is there a more Pythonic/elegant way to expand the dimensions of a Numpy Array?
                            
                                UnicodeEncodeError in python3
                            
                                Confusion Matrix for 10-fold cross validation in scikit learn
                            
                                Pygame Import Error: cannot import name _camera while accessing webcamera
                            
                                Datastax Python cassandra driver build fails on Ubuntu
                            
                                mkvirtualenv ImportError: No module named stevedore
                            
                                Is directly accessing class attribute faster than getting the value via a getter function?
                            
                                ProgrammingError: can't adapt type 'set'
                            
                                How to predict input image with trained model in Keras?
                            
                                Handling empty case with tuple filtering and unpacking
                            
                                How to convert an xarray dataset to pandas dataframes inside a dask dataframe
                            
                                HEAD method not allowed after upgrading to django-rest-framework 3.5.3
                            
                                Using decorator from base class both in base class and derived class
                            
                                Check if my application runs in development/editable mode
                            
                                What is the meaning of 'cp3xm' and 'cp2xm' in the name of Python wheels?
                            
                                numpy: broadcast multiplication over one common axis of two 2d arrays

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Dask - dataframe.map_partitions() return value

Tags:

python

pandas

dask

StarFox

People also ask

1 Answers

Arco Bast

Recent Activity

Donate For Us