Multiple aggregation user defined functions in Dask dataframe

Tags:

I'm processing a data set using Dask (considering it doesn't fit in memory) and I want to group the instances with a different aggregating function depending on the column and it's type.

Dask has a set of default aggregation functions for numerical data types, but not for strings/objects. Is there a way to implement a user defined aggregation function for strings somewhat similar to the example below?

atts_to_group = {'A', 'B'}
agg_fn = {
  'C': 'mean'  #int
  'D': 'concatenate_fn1'  #string - No default fn for strings - Doesn't work
  'E': 'concatenate_fn2'  #string
}
ddf = ddf.groupby(atts_to_group).agg(agg_fn).compute().reset_index()

At this point I'm able to read the whole data set in memory upon dropping irrelevant columns/rows, but I'd prefer continuing the processing in Dask considering it's faster performing the required operations.

Edit: Tried adding a custom function directly onto the dictionary:

def custom_concat(df):
    ...
    return df_concatd

agg_fn = {
  'C': 'mean'  #int
  'D': custom_concat(df)
}

-------------------------------------------------------
ValueError: unknown aggregate Dask DataFrame Structure:

567

asked Sep 03 '18 12:09

GRoutar

1 Answers

Realised Dask provides with an Aggregation data structure. The custom aggregation can be done as follows:

# Concatenates the strings and separates them using ","
custom_concat = dd.Aggregation('custom_sum', lambda x: ",".join(str(x)), lambda x0: ",".join(str(x0)))
custom_concat_E = ...

atts_to_group = {'A', 'B'}
agg_fn = {
  'C': 'mean'  #int
  'D': custom_concat_D
  'E': custom_concat_E
}
ddf = ddf.groupby(atts_to_group).agg(agg_fn).compute().reset_index()

This can also be done with Dataframe.apply for a less verbose solution

def agg_fn(x):
    return pd.Series(
        dict(
            C = x['C'].mean(), # int
            D = "{%s}" % ', '.join(x['D']), # string (concat strings)
            E = ...
        )
    )

ddf = ddf.groupby(atts_to_group).apply(agg_fn).compute().reset_index

136

answered Sep 19 '22 02:09

GRoutar

Related questions
                            
                                How to apply Polyglot Detector function to dataframe
                            
                                How to break up lambda function in to its own function? (Lambda is currently 125+ characters)
                            
                                How do I get the area of a GeoJSON polygon with Python
                            
                                How to fix Error "No module named 'pynput'"? even after downloading with pip?
                            
                                Better Approach than FuzzyWuzzy?
                            
                                Classification report for regression (sklearn)
                            
                                How to generate a paper-like background with OpenCV
                            
                                Predicting multiple variables at once with Facebook Prophet
                            
                                Using whitespace in class names in Python
                            
                                Python: Use PIL to load png file gives strange results
                            
                                Seaborn's boxplot whiskers meaning
                            
                                Why is this range variable declared before being used?
                            
                                Standardize dataset containing too large values
                            
                                How to specify driver class path when using pyspark within a jupyter notebook?
                            
                                How does one handle multiple modules/packages in python?
                            
                                PySpark - Compare DataFrames
                            
                                Tensorflow numpy image reshape [grayscale images]
                            
                                How to mock RPi.GPIO in python
                            
                                MoviePY can't detect ImageMagick binary on Windows
                            
                                pandas apply changing dtype

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multiple aggregation user defined functions in Dask dataframe

Tags:

python

dataframe

aggregation

group-by

dask

GRoutar

People also ask

1 Answers

GRoutar

Recent Activity

Donate For Us