What problems can arise from a Spark non-deterministic Pandas UDF

Tags:

I'm writing a process that needs to generate a UUID for certain groups that match based on some criteria. I got my code working, but I'm worried about potential issues from creating the UUID within my UDF (thus making it non-deterministic). Here's a simplified example of some code to illustrate:

from uuid import uuid1

from pyspark.sql import SparkSession
from pyspark.sql.functions import PandasUDFType, pandas_udf

spark = (
    SparkSession.builder.master("local")
    .appName("Word Count")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)
df = spark.createDataFrame([["j", 3], ["h", 3], ["a", 2]], ["name", "age"])


@pandas_udf("name string, age integer, uuid string", PandasUDFType.GROUPED_MAP)
def create_uuid(df):
    df["uuid"] = str(uuid1())
    return df


>>> df.groupby("age").apply(create_uuid).show()
+----+---+--------------------+
|name|age|                uuid|
+----+---+--------------------+
|   j|  3|1f8f48ac-0da8-430...|
|   h|  3|1f8f48ac-0da8-430...|
|   a|  2|d5206d03-bcce-445...|
+----+---+--------------------+

This currently works on some data processing over 200k records on AWS Glue, and I haven't found any bugs yet.

I use uuid1 since that uses the node information to generate the UUID thus ensuring no 2 nodes generate the same id.

One thought I had was to register the UDF as non-deterministic with:

udf = pandas_udf(
    create_uuid, "name string, age integer, uuid string", PandasUDFType.GROUPED_MAP
).asNondeterministic()

But that gave me the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o60.flatMapGroupsInPandas.
: org.apache.spark.sql.AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate or Window, found:
 `age`,create_uuid(name, age),`name`,`age`,`uuid`
in operator FlatMapGroupsInPandas [age#1L], create_uuid(name#0, age#1L), [name#7, age#8, uuid#9]
               ;;
FlatMapGroupsInPandas [age#1L], create_uuid(name#0, age#1L), [name#7, age#8, uuid#9]
+- Project [age#1L, name#0, age#1L]
   +- LogicalRDD [name#0, age#1L], false

My questions are:

What are some potential issues this could encounter?
If it does have potential issues, what are some says in which I could make this deterministic?
Why can't GROUPED_MAP functions be labeled as non-deterministic?

702

asked May 17 '20 06:05

aiguofer

1 Answers

Your function is non-deterministic, but Spark is treating it as deterministic i.e. "Due to optimization, duplicate invocations maybe eliminated". However, each call to the pandas_udf will be a unique input (rows grouped by key), so the optimisation for duplicate calls to the pandas_udf won't be triggered. Hence, the asNondeterministic method to suppress such optimisations is redundant for a pandas_udf of GROUPED_MAP type. In my opinion, this explains why the GroupedData.apply function has not been coded to accept a pandas_udf marked as non-deterministic. It would make no sense to as there are no optimisation opportunities to suppress.

answered Oct 02 '22 00:10

Chris

Related questions
                            
                                can not install psycopg2 on macOS Catalina
                            
                                Plot FFT as a set of sine waves in python?
                            
                                How to use multiprocessing to drop duplicates in a very big list?
                            
                                Access deprecated attribute "validation_data" in tf.keras.callbacks.Callback
                            
                                Where does zappa upload environment variables to?
                            
                                Is it possible to change the pydantic error messages in fastAPI?
                            
                                Flask: orjson instead of json module for decoding
                            
                                How to use type hinting with dictionaries and google protobuf enum?
                            
                                Count number of the blues lines on white background in the image
                            
                                how to use python -c on windows?
                            
                                FastAPI - mocking path function has no effect
                            
                                What is the point of norm.fit in scipy?
                            
                                Ego Graph in NetworkX
                            
                                Python (datetime) timezone conversion off by 4 minutes
                            
                                python nested json to csv/xlsx with specified headers
                            
                                How To downgrade Torch Version for Google Colab
                            
                                Seaborn pairplots with continuous hues?
                            
                                Returned non zero exit status 101 giving an error when I tried create a Virtual environment
                            
                                TypeError: python multiple inheritance with different arguments
                            
                                How to use the new NumPy random number generator?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What problems can arise from a Spark non-deterministic Pandas UDF

Tags:

python

pandas

apache-spark

apache-spark-sql

pyspark

aiguofer

People also ask

1 Answers

Chris

Recent Activity

Donate For Us