Efficient way to add UUID in pyspark [duplicate]

Tags:

I have a DataFrame that I want to add a column of distinct uuid4() rows. My code:

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StringType

from uuid import uuid4

spark_session = SparkSession.builder.getOrCreate()

df = spark_session.createDataFrame([
        [1, 1, 'teste'],
        [2, 2, 'teste'],
        [3, 0, 'teste'],
        [4, 5, 'teste'],
    ],
    list('abc'))


df = df.withColumn("_tmp", f.lit(1))

uuids = [str(uuid4()) for _ in range(df.count())]
df1 = spark_session.createDataFrame(uuids, StringType())
df1 = df_1.withColumn("_tmp", f.lit(1))


df2 = df.join(df_1, "_tmp", "inner").drop("_tmp")
df2.show()

But I've got this ERROR:

Py4JJavaError: An error occurred while calling o1571.showString.
: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans

I already try with alias and using monotonically_increasing_id as the join column, but I see here that I cannot trust in monotonically_increasing_id as merge column. I'm expecting:

+---+---+-----+------+
|  a|  b|    c| value|
+---+---+-----+------+
|  1|  1|teste| uuid4|
|  2|  2|teste| uuid4|
|  3|  0|teste| uuid4|
|  4|  5|teste| uuid4|
+---+---+-----+------+

what's the correct approach in this case?

520

asked Mar 11 '20 15:03

bcosta12

1 Answers

I use row_number as @Tetlanesh suggest. I have to create an ID column to ensure that row_number count every row of Window.

from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from uuid import uuid4
from pyspark.sql.window import Window
from pyspark.sql.types import StringType

from pyspark.sql.functions import row_number


spark_session = SparkSession.builder.getOrCreate()

df = spark_session.createDataFrame([
        [1, 1, 'teste'],
        [1, 2, 'teste'],
        [2, 0, 'teste'],
        [2, 5, 'teste'],
    ],
    list('abc'))

df = df.alias("_tmp")
df.registerTempTable("_tmp")

df2 = self.spark_session.sql("select *, uuid() as uuid from _tmp")

df2.show()

Another approach is using windows, but It's not efficient as the first one:


df = df.withColumn("_id", f.lit(1))
df = df.withColumn("_tmp", row_number().over(Window.orderBy('_id')))

uuids = [(str(uuid4()), 1) for _ in range(df.count())]
df1 = spark_session.createDataFrame(uuids, ['uuid', '_id'])
df1 = df1.withColumn("_tmp", row_number().over(Window.orderBy('_id')))


df2 = df.join(df1, "_tmp", "inner").drop('_id')

df2.show()

both outputs:

+---+---+-----+------+
|  a|  b|    c|  uuid|
+---+---+-----+------+
|  1|  1|teste| uuid4|
|  2|  2|teste| uuid4|
|  3|  0|teste| uuid4|
|  4|  5|teste| uuid4|
+---+---+-----+------+

176

answered Sep 30 '22 14:09

bcosta12

Related questions
                            
                                Replace character with another only if repeated and not part of a word
                            
                                Driver's SQLAllocHandle on SQL_HANDLE_HENV failed (0) (SQLDriverConnect) when connecting to Azure SQL database from Python running in OpenShift
                            
                                My android app can't receive fcm message from python
                            
                                How to set path to unrar library in Python?
                            
                                Install pypy3 on raspberry pi
                            
                                Import PySpin in Conda: fails to find mkl_intel_thread.dll
                            
                                Cannot update python package on anaconda to latest version
                            
                                What are the specific rules for constant folding?
                            
                                What is the optimal way to create a graph with add_edge_list() method?
                            
                                How to add a package-specific index-url to requirements.txt?
                            
                                How to join a list of multiprocessing.Process() at the same time?
                            
                                Lambda Python to Query SSM Parameter Store Value
                            
                                import in python 3, explain the output please
                            
                                set multiple column values to NaN based on condition
                            
                                Issue processing data read from serial port, when displaying it in a Tkinter textbox
                            
                                Python Error : (fields.E304) Reverse accessor for field clashes with reverse accessor for another field
                            
                                How to generate a PDF with a given template, with dynamic data in Python or NodeJS to be deployed on AWS
                            
                                How do I rename a key while preserving order in dictionaries (Python 3.7+)?
                            
                                Passing multiple columns in Pandas UDF PySpark
                            
                                What are these set operations, and why do they give different results?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient way to add UUID in pyspark [duplicate]

Tags:

python-3.x

apache-spark

pyspark

bcosta12

People also ask

1 Answers

bcosta12

Recent Activity

Donate For Us