On Dask DataFrame.apply(), receiving n rows of value 1 before actual rows processed

Tags:

In the below code snippet, I would expect the logs to print the numbers 0 - 4. I understand that the numbers may not be in that order, as the task would be broken up into a number of parallel operations.

Code snippet:

from dask import dataframe as dd
import numpy as np
import pandas as pd

df = pd.DataFrame({'A': np.arange(5),
                   'B': np.arange(5),
                   'C': np.arange(5)})

ddf = dd.from_pandas(df, npartitions=1)

def aggregate(x):
    print('B val received: ' + str(x.B))
    return x

ddf.apply(aggregate, axis=1).compute()

But when the above code is run, I see this instead:

B val received: 1
B val received: 1
B val received: 1
B val received: 0
B val received: 0
B val received: 1
B val received: 2
B val received: 3
B val received: 4

Instead of 0 - 4, I see a series of 1 printed first, and an extra 0. I have noticed the "extra" rows of value 1 occurring every time I have set up a Dask DataFrame and run an apply operation on it.

Printing the dataframe shows no additional rows with value 1 throughout:

My question is: Where are these rows with value 1 coming from? Why do they appear to consistently occur prior to the "actual" rows in the dataframe? The 1 values seem unrelated to the values in the actual rows (that is, it is not as though it is for some reason grabbing the second row an extra few times).

417

asked Apr 14 '17 18:04

kuanb

2 Answers

@Grr 's answer is correct. Dask.dataframe doesn't know what your function will produce, but still has to provide a lazy dask.dataframe for you with the correct types, dtypes, etc., so it tries your function on a little bit of data.

You can avoid these checks by providing metadata about your intended output using the meta= keyword (more details in the DataFrame.apply docstring). If you provide this information then Dask.dataframe will not need to try your function to determine types.

Copying this section here:

Docstring

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided. Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

Solution

So if you create an example output as an empty dataframe then you'll be fine:

meta = pd.DataFrame({'A': [1], 'B': [2], 'C': [3]}, 
                    columns=['A', 'B', 'C'])
ddf.apply(aggregate, axis=1, meta=meta)

Or, in this case because your function doesn't change the columns or dtype of the input, you can just use the input's meta

ddf.apply(aggregate, axis=1, meta=ddf.meta)

108

answered Nov 01 '22 21:11

MRocklin

Dask does some checking on what you have told it to do before it tries to do it on the entire collection of partitions. That is where the first few print statements are coming from. It's part of the built in error checking that prevents Dask from going down some long winded series of operations and failing at the end.

answered Nov 01 '22 21:11

Grr

Related questions
                            
                                using best params from gridsearchcv
                            
                                sudo and pip not on the same path
                            
                                Python selenium not work with WebDriverWait
                            
                                Considerations for using ReLU as activation function
                            
                                How to rearrange one list based on a second list of indices [duplicate]
                            
                                python & postgresql: reliably check for updates in a specific table
                            
                                Adding global attribute using xarray
                            
                                Difference between Tensorflow convolution and numpy convolution
                            
                                Escape analysis
                            
                                Pandas - Counting quantity of commas in character field
                            
                                I deleted my dict, but my dict_keys don't mind, why is that?
                            
                                Get the inverse function of a polyfit in numpy
                            
                                error using gmail api tuto using python 3 "except errors.HttpError, error:"
                            
                                Nested merges in pandas with suffixes
                            
                                How to get round the HTTP Error 403: Forbidden with urllib.request using Python 3
                            
                                DynamoDB - How to query a nested attribute boto3
                            
                                pandas return columns in dataframe that are not in other dataframe
                            
                                How to make tkinter button widget take up full width of grid
                            
                                Pandas: Optimal way to MultiIndex columns
                            
                                Python joining list elements in a tricky way

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

On Dask DataFrame.apply(), receiving n rows of value 1 before actual rows processed

Tags:

python

parallel-processing

dask