I have this python code that runs locally in a pandas dataframe: <pre class="prettyprint"><code>df_result = pd.DataFrame(df .groupby('A') .apply(lambda x: myFunction(zip(x.B, x.C), x.name)) </code></pre> I would like to run this in PySpark, but having trouble dealing with pyspark.sql.group.GroupedData object. I've tried the following: <pre class="prettyprint"><code>sparkDF .groupby('A') .agg(myFunction(zip('B', 'C'), 'A')) </code></pre> which returns <pre class="prettyprint"><code>KeyError: 'A' </code></pre> I presume because 'A' is no longer a column and I can't find the equivalent for x.name. And then <pre class="prettyprint"><code>sparkDF .groupby('A') .map(lambda row: Row(myFunction(zip('B', 'C'), 'A'))) .toDF() </code></pre> but get the following error: <pre class="prettyprint"><code>AttributeError: 'GroupedData' object has no attribute 'map' </code></pre> Any suggestions would be really appreciated!

Since Spark 2.3 you can use <code>pandas_udf</code>. <code>GROUPED_MAP</code> takes <code>Callable[[pandas.DataFrame], pandas.DataFrame]</code> or in other words a function which maps from Pandas <code>DataFrame</code> of the same shape as the input, to the output <code>DataFrame</code>. For example if data looks like this: <pre class="prettyprint"><code>df = spark.createDataFrame( [("a", 1, 0), ("a", -1, 42), ("b", 3, -1), ("b", 10, -2)], ("key", "value1", "value2") ) </code></pre> and you want to compute average value of pairwise min between <code>value1</code> <code>value2</code>, you have to define output schema: <pre class="prettyprint"><code>from pyspark.sql.types import * schema = StructType([ StructField("key", StringType()), StructField("avg_min", DoubleType()) ]) </code></pre> <code>pandas_udf</code>: <pre class="prettyprint"><code>import pandas as pd from pyspark.sql.functions import pandas_udf from pyspark.sql.functions import PandasUDFType @pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP) def g(df): result = pd.DataFrame(df.groupby(df.key).apply( lambda x: x.loc[:, ["value1", "value2"]].min(axis=1).mean() )) result.reset_index(inplace=True, drop=False) return result </code></pre> and apply it: <pre class="prettyprint"><code>df.groupby("key").apply(g).show() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+---+-------+ |key|avg_min| +---+-------+ | b| -1.5| | a| -0.5| +---+-------+ </code></pre> Excluding schema definition and decorator, your current Pandas code can be applied as-is. Since Spark 2.4.0 there is also <code>GROUPED_AGG</code> variant, which takes <code>Callable[[pandas.Series, ...], T]</code>, where <code>T</code> is a primitive scalar: <pre class="prettyprint"><code>import numpy as np @pandas_udf(DoubleType(), functionType=PandasUDFType.GROUPED_AGG) def f(x, y): return np.minimum(x, y).mean() </code></pre> which can be used with standard <code>group_by</code> / <code>agg</code> construct: <pre class="prettyprint"><code>df.groupBy("key").agg(f("value1", "value2").alias("avg_min")).show() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+---+-------+ |key|avg_min| +---+-------+ | b| -1.5| | a| -0.5| +---+-------+ </code></pre> Please note that neither <code>GROUPED_MAP</code> nor <code>GROUPPED_AGG</code> <code>pandas_udf</code> behave the same way as <code>UserDefinedAggregateFunction</code> or <code>Aggregator</code>, and it is closer to <code>groupByKey</code> or window functions with unbounded frame. Data is shuffled first, and only after that, UDF is applied. For optimized execution you should implement Scala <code>UserDefinedAggregateFunction</code> and add Python wrapper. See also User defined function to be applied to Window in PySpark?

Applying UDFs on GroupedData in PySpark (with functioning python example)

Tags:

python

apache-spark

apache-spark-sql

pyspark

user-defined-functions

I have this python code that runs locally in a pandas dataframe:

df_result = pd.DataFrame(df
                          .groupby('A')
                          .apply(lambda x: myFunction(zip(x.B, x.C), x.name))

I would like to run this in PySpark, but having trouble dealing with pyspark.sql.group.GroupedData object.

I've tried the following:

sparkDF
 .groupby('A')
 .agg(myFunction(zip('B', 'C'), 'A'))

which returns

KeyError: 'A'

I presume because 'A' is no longer a column and I can't find the equivalent for x.name.

And then

sparkDF
 .groupby('A')
 .map(lambda row: Row(myFunction(zip('B', 'C'), 'A'))) 
 .toDF()

but get the following error:

AttributeError: 'GroupedData' object has no attribute 'map'

Any suggestions would be really appreciated!

950

asked Oct 12 '16 19:10

arosner09

4 Answers

Since Spark 2.3 you can use pandas_udf. GROUPED_MAP takes Callable[[pandas.DataFrame], pandas.DataFrame] or in other words a function which maps from Pandas DataFrame of the same shape as the input, to the output DataFrame.

For example if data looks like this:

df = spark.createDataFrame(
    [("a", 1, 0), ("a", -1, 42), ("b", 3, -1), ("b", 10, -2)],
    ("key", "value1", "value2")
)

and you want to compute average value of pairwise min between value1 value2, you have to define output schema:

from pyspark.sql.types import *

schema = StructType([
    StructField("key", StringType()),
    StructField("avg_min", DoubleType())
])

pandas_udf:

import pandas as pd

from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType

@pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def g(df):
    result = pd.DataFrame(df.groupby(df.key).apply(
        lambda x: x.loc[:, ["value1", "value2"]].min(axis=1).mean()
    ))
    result.reset_index(inplace=True, drop=False)
    return result

and apply it:

df.groupby("key").apply(g).show()

+---+-------+
|key|avg_min|
+---+-------+
|  b|   -1.5|
|  a|   -0.5|
+---+-------+

Excluding schema definition and decorator, your current Pandas code can be applied as-is.

Since Spark 2.4.0 there is also GROUPED_AGG variant, which takes Callable[[pandas.Series, ...], T], where T is a primitive scalar:

import numpy as np

@pandas_udf(DoubleType(), functionType=PandasUDFType.GROUPED_AGG)
def f(x, y):
    return np.minimum(x, y).mean()

which can be used with standard group_by / agg construct:

df.groupBy("key").agg(f("value1", "value2").alias("avg_min")).show()

+---+-------+
|key|avg_min|
+---+-------+
|  b|   -1.5|
|  a|   -0.5|
+---+-------+

Please note that neither GROUPED_MAP nor GROUPPED_AGG pandas_udf behave the same way as UserDefinedAggregateFunction or Aggregator, and it is closer to groupByKey or window functions with unbounded frame. Data is shuffled first, and only after that, UDF is applied.

For optimized execution you should implement Scala UserDefinedAggregateFunction and add Python wrapper.

See also User defined function to be applied to Window in PySpark?

answered Oct 19 '22 08:10

zero323

What you are trying to is write a UDAF (User Defined Aggregate Function) as opposed to a UDF (User Defined Function). UDAFs are functions that work on data grouped by a key. Specifically they need to define how to merge multiple values in the group in a single partition, and then how to merge the results across partitions for key. There is currently no way in python to implement a UDAF, they can only be implemented in Scala.

But, you can work around it in Python. You can use collect set to gather your grouped values and then use a regular UDF to do what you want with them. The only caveat is collect_set only works on primitive values, so you will need to encode them down to a string.

from pyspark.sql.types import StringType
from pyspark.sql.functions import col, collect_list, concat_ws, udf

def myFunc(data_list):
    for val in data_list:
        b, c = data.split(',')
        # do something

    return <whatever>

myUdf = udf(myFunc, StringType())

df.withColumn('data', concat_ws(',', col('B'), col('C'))) \
  .groupBy('A').agg(collect_list('data').alias('data'))
  .withColumn('data', myUdf('data'))

Use collect_set if you want deduping. Also, if you have lots of values for some of your keys, this will be slow because all values for a key will need to be collected in a single partition somewhere on your cluster. If your end result is a value you build by combining the values per key in some way (for example summing them) it might be faster to implement it using the RDD aggregateByKey method which lets you build an intermediate value for each key in a partition before shuffling data around.

EDIT: 11/21/2018

Since this answer was written, pyspark added support for UDAF'S using Pandas. There are some nice performance improvements when using the Panda's UDFs and UDAFs over straight python functions with RDDs. Under the hood it vectorizes the columns (batches the values from multiple rows together to optimize processing and compression). Take a look at here for a better explanation or look at user6910411's answer below for an example.

answered Oct 19 '22 08:10

Ryan Widmaier

I am going to extend above answer.

So you can implement same logic like pandas.groupby().apply in pyspark using @pandas_udf and which is vectorization method and faster then simple udf.

from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd

df3 = spark.createDataFrame([('a', 1, 0), ('a', -1, 42), ('b', 3, -1),
                            ('b', 10, -2)], ('key', 'value1', 'value2'))

from pyspark.sql.types import *

schema = StructType([StructField('key', StringType()),
                    StructField('avg_value1', DoubleType()),
                    StructField('avg_value2', DoubleType()),
                    StructField('sum_avg', DoubleType()),
                    StructField('sub_avg', DoubleType())])


@pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def g(df):
    gr = df['key'].iloc[0]
    x = df.value1.mean()
    y = df.value2.mean()
    w = df.value1.mean() + df.value2.mean()
    z = df.value1.mean() - df.value2.mean()
    return pd.DataFrame([[gr] + [x] + [y] + [w] + [z]])

df3.groupby('key').apply(g).show()

You will get below result:

+---+----------+----------+-------+-------+
|key|avg_value1|avg_value2|sum_avg|sub_avg|
+---+----------+----------+-------+-------+
|  b|       6.5|      -1.5|    5.0|    8.0|
|  a|       0.0|      21.0|   21.0|  -21.0|
+---+----------+----------+-------+-------+

So , You can do more calculation between other fields in grouped data.and add them into dataframe in list format.

answered Oct 19 '22 08:10

Mayur Dangar

Another extend new in PySpark version 3.0.0: applyInPandas

df = spark.createDataFrame([(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], 
                            ("id", "v"))  

def mean_func(key, pdf):
   # key is a tuple of one numpy.int64, which is the value
   # of 'id' for the current group
   return pd.DataFrame([key + (pdf.v.mean(),)])

df.groupby('id').applyInPandas(mean_func, schema="id long, v double").show()

results in:

+---+---+
| id|  v|
+---+---+
|  1|1.5|
|  2|6.0|
+---+---+

for further details see: https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.GroupedData.applyInPandas.html

answered Oct 19 '22 07:10

Jan_ewazz

Related questions
                            
                                Should I use a main() method in a simple Python script?
                            
                                Remove seaborn lineplot legend title
                            
                                syntax for creating a dictionary into another dictionary in python [duplicate]
                            
                                How to delete the words between two delimiters?
                            
                                Remove punctuation from Unicode formatted strings
                            
                                How to check the existence of a row in SQLite with Python?
                            
                                SQLAlchemy calculated column
                            
                                How to run Debug server for Django project in PyCharm Community Edition?
                            
                                How can I select all of the Sundays for a year using Python?
                            
                                How can I find the dimensions of a matrix in Python?
                            
                                Python httplib ResponseNotReady
                            
                                Global Variable from a different file Python
                            
                                Problems embedding IronPython in C# (Missing Compiler required member 'Microsoft.CSharp.RuntimeBinder.Binder.InvokeMember'
                            
                                How can I run the fast-api server using Pycharm?
                            
                                General Unicode/UTF-8 support for csv files in Python 2.6
                            
                                Python ASCII plots in terminal
                            
                                How to get the dimensions of a tensor (in TensorFlow) at graph construction time?
                            
                                Partial list unpack in Python
                            
                                How to print line breaks in Python Django template
                            
                                multiple SparkContexts error in tutorial

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Applying UDFs on GroupedData in PySpark (with functioning python example)

Tags:

python

apache-spark

apache-spark-sql

pyspark

user-defined-functions

arosner09

People also ask

4 Answers

zero323

Ryan Widmaier

Mayur Dangar

Jan_ewazz

Recent Activity

Donate For Us