Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Safe & performant way to modify dask dataframe

Tags:

As a part of data workflow I need to modify values in a subset of dask dataframe columns and pass the results for further computation. In particular, I'm interested in 2 cases: mapping columns and mapping partitions. What is the recommended safe & performant way to act on the data? I'm running it a distributed setup on a cluster with multiple worker processes on each host.

Case1.

I want to run:

res = dataframe.column.map(func, ...)

this returns a data series so I assume that original dataframe is not modified. Is it safe to assign a column back to the dataframe e.g. dataframe['column']=res? Probably not. Should I make a copy with .copy() and then assign result to it like:

dataframe2 = dataframe.copy()
dataframe2['column'] = dataframe.column.map(func, ...)

Any other recommended way to do it?

Case2

I need to map partitions of the dataframe:

df.map_partitions(mapping_func, meta=df)

Inside the mapping_func() I want to modify values in chosen columns, either by using partition[column].map or simply by creating a list comprehension. Again, how do modify the partition safely and return it from the mapping function?

Partition received by mapping function is a Pandas dataframe (copy of original data?) but while modifying data in-place I'm seeing some crashes (no exception/error messages though). Same goes for calling partition.copy(deep=False), it doesn't work. Should partition be deep copied and then modified in-place? Or should I always construct a new dataframe out of new/mapped column data and original/unmodified series/columns?

like image 924
evilkonrex Avatar asked Sep 05 '17 10:09

evilkonrex


People also ask

What does SAFe mean in agile?

The Scaled Agile Framework® (SAFe®) is a set of organizational and workflow patterns for implementing agile practices at an enterprise scale. The framework is a body of knowledge that includes structured guidance on roles and responsibilities, how to plan and manage the work, and values to uphold.

What is SAFe used for?

SAFe promotes alignment, collaboration, and delivery across large numbers of agile teams. It was developed by and for practitioners, by leveraging three primary bodies of knowledge: agile software development, lean product development, and systems thinking.

What does SAFe refer to?

secure from liability to harm, injury, danger, or risk: a safe place.

What are the 4 levels of SAFe?

SAFe Full Configuration consists of four levels: Team, Program, Large Solution and Portfolio.


1 Answers

You can safely modify a dask.dataframe

Operations like the following are supported and safe

df['col'] = df['col'].map(func)

This modifies the task graph in place but does not modify the data in place (assuming that the function func creates a new series).

You can not safely modify a partition

Your second case when you map_partitions a function that modifies a pandas dataframe in place is not safe. Dask expects to be able to reuse data, call functions twice if necessary, etc.. If you have such a function then you should create a copy of the Pandas dataframe first within that function.

like image 128
MRocklin Avatar answered Sep 30 '22 15:09

MRocklin