Spark streaming with python: how to add a UUID column?

Tags:

I would like to add a column with a generated id to my data frame. I have tried:

uuidUdf = udf(lambda x: str(uuid.uuid4()), StringType())
df = df.withColumn("id", uuidUdf())

however, when I do this, nothing is written to my output directory. When I remove these lines, everything works fine so there must be some error but I don't see anything in the console.

I have tried using monotonically_increasing_id() instead of generating a UUID but in my testing, this produces many duplicates. I need a unique identifier (does not have to be a UUID specifically).

How can I do this?

573

asked Apr 11 '18 22:04

bea

1 Answers

Please Try this:

import uuid
from pyspark.sql.functions import udf

uuidUdf= udf(lambda : str(uuid.uuid4()),StringType())
Df1 = Df.withColumn("id",uuidUdf())

Note: You should assign to new DF after adding new column. (Df1 = Df.withColumn(....)

answered Oct 11 '22 18:10

Atanu chatterjee

Related questions
                            
                                Getting labels from StringIndexer stages within pipeline in Spark (pyspark)
                            
                                Django delete cache with specific key_prefix
                            
                                Pandas, normalising json-per-line
                            
                                Matplotlib bar graph not drawing borders/edges
                            
                                Python - how to multiply characters in string by number after character
                            
                                How to fill pandas dataframe columns with random dictionary values
                            
                                How run a scrapy spider programmatically like a simple script?
                            
                                Plotly legend next to each subplot, Python
                            
                                Are Pandas' dataframes (Python) closer to R's dataframes or datatables? [closed]
                            
                                Mock authentication decorator in unittesting
                            
                                How to create packages in Python 3? ModuleNotFoundError
                            
                                Reindexing a specific level of a MultiIndex dataframe
                            
                                Where is dumped file in Google Colab?
                            
                                Show exhaustive information for passed tests in pytest
                            
                                In python assert, how to print the condition when the assertion failed?
                            
                                Flask-SqlAlchemy Many-To-Many relationship with duplicates allowed
                            
                                ValueError: wrapper loop when unwrapping
                            
                                Custom exceptions in unittests
                            
                                Access child class variable in parent class
                            
                                Determine if object is of type Foo without importing type Foo

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark streaming with python: how to add a UUID column?

Tags:

python

uuid

apache-spark

pyspark

bea

People also ask

1 Answers

Atanu chatterjee

Recent Activity

Donate For Us