Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the return value of map_partitions?

The dask API says, that map_partition can be used to "apply a Python function on each DataFrame partition." From this description and according to the usual behaviour of "map", I would expect the return value of map_partitions to be (something like) a list whose length equals the number of partitions. Each element of the list should be one of the return values of the function calls.

However, with respect to the following code, I am not sure, what the return value depends on:

#generate example dataframe
pdf = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
ddf = dd.from_pandas(pdf, npartitions=3)

#define helper function for map. VAL is the return value
VAL = pd.Series({'A': 1})
#VAL = pd.DataFrame({'A': [1]}) #other return values used in this example
#VAL = None
#VAL = 1
def helper(x):
    print('function called\n')
    return VAL

#check result
out = ddf.map_partitions(helper).compute()
print(len(out))
  • VAL = pd.Series({'A': 1}) causes 4 function calls (probably one to infer the dtype and 3 for the partitions) and an output with len == 3 and the type pd.Series.
  • pd.DataFrame({'A': [1]}) results in the same numbers, however the resulting type is pd.DataFrame.
  • VAL = None causes an TypeError ... why? Couldn't a possible use of map_partitions be to do something rather than to return something?
  • VAL = 1 results in only 2 function calls. The result of map_partitions is the integer 1.

Therefore, I want to ask some questions:

  1. how is the return value of map_partitions determined?
  2. What influences the number of function calls besides the number of partitions / What criteria has a function to fulfil to be called once with each partition?
  3. What should be the return value of a function, that only "does" something, i.e. a procedure?
  4. How should a function be designed, that returns arbitrary objects?
like image 417
Arco Bast Avatar asked Aug 29 '16 21:08

Arco Bast


1 Answers

The Dask DataFrame.map_partitions function returns a new Dask Dataframe or Series, based on the output type of the mapped function. See the API documentation for a thorough explanation.

  1. How is the return value of map_partitions determined?

    See the API docs referred to above.

  2. What influences the number of function calls besides the number of partitions / What criteria has a function to fulfil to be called once with each partition?

    You're correct that we're calling it once immediately to guess the dtypes/columns of the output. You can avoid this by specifying a meta= keyword directly. Other than that the function is called once per partition.

  3. What should be the return value of a function, that only "does" something, i.e. a procedure?

    You could always return an empty dataframe. You might also want to consider transforming your dataframe into a sequence of dask.delayed objects, which are typically more often used for ad-hoc computations.

  4. How should a function be designed, that returns arbitrary objects?

    If your function doesn't return series/dataframes then I recommend converting your dataframe to a sequence of dask.delayed objects with the DataFrame.to_delayed method.

like image 126
MRocklin Avatar answered Oct 12 '22 21:10

MRocklin