I would like to add a column with a generated id to my data frame. I have tried:
uuidUdf = udf(lambda x: str(uuid.uuid4()), StringType())
df = df.withColumn("id", uuidUdf())
however, when I do this, nothing is written to my output directory. When I remove these lines, everything works fine so there must be some error but I don't see anything in the console.
I have tried using monotonically_increasing_id() instead of generating a UUID but in my testing, this produces many duplicates. I need a unique identifier (does not have to be a UUID specifically).
How can I do this?
In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .
You can add multiple columns to Spark DataFrame in several ways if you wanted to add a known set of columns you can easily do by chaining withColumn() or on select(). However, sometimes you may need to add multiple columns after applying some transformations n that case you can use either map() or foldLeft().
If you would like to add a prefix or suffix to multiple columns in a pyspark dataframe, you could use a for loop and . withColumnRenamed(). You can amend sdf. columns as you see fit.
In PySpark, you can cast or change the DataFrame column data type using cast() function of Column class, in this article, I will be using withColumn(), selectExpr() , and SQL expression to cast the from String to Int (Integer Type), String to Boolean e.t.c using PySpark examples.
Please Try this:
import uuid
from pyspark.sql.functions import udf
uuidUdf= udf(lambda : str(uuid.uuid4()),StringType())
Df1 = Df.withColumn("id",uuidUdf())
Note: You should assign to new DF after adding new column. (Df1 = Df.withColumn(....)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With