Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark streaming with python: how to add a UUID column?

I would like to add a column with a generated id to my data frame. I have tried:

uuidUdf = udf(lambda x: str(uuid.uuid4()), StringType())
df = df.withColumn("id", uuidUdf())

however, when I do this, nothing is written to my output directory. When I remove these lines, everything works fine so there must be some error but I don't see anything in the console.

I have tried using monotonically_increasing_id() instead of generating a UUID but in my testing, this produces many duplicates. I need a unique identifier (does not have to be a UUID specifically).

How can I do this?

like image 573
bea Avatar asked Apr 11 '18 22:04

bea


People also ask

How do I add a column to a spark in Python?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do I add columns to spark?

You can add multiple columns to Spark DataFrame in several ways if you wanted to add a known set of columns you can easily do by chaining withColumn() or on select(). However, sometimes you may need to add multiple columns after applying some transformations n that case you can use either map() or foldLeft().

How do you add a prefix to all columns in PySpark?

If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and . withColumnRenamed(). You can amend sdf. columns as you see fit.

How do I cast a column type in PySpark?

In PySpark, you can cast or change the DataFrame column data type using cast() function of Column class, in this article, I will be using withColumn(), selectExpr() , and SQL expression to cast the from String to Int (Integer Type), String to Boolean e.t.c using PySpark examples.


1 Answers

Please Try this:

import uuid
from pyspark.sql.functions import udf

uuidUdf= udf(lambda : str(uuid.uuid4()),StringType())
Df1 = Df.withColumn("id",uuidUdf())

Note: You should assign to new DF after adding new column. (Df1 = Df.withColumn(....)

like image 52
Atanu chatterjee Avatar answered Oct 11 '22 18:10

Atanu chatterjee