I tried to initialize new columns with random values in pandas. I did this way
df['business_vertical'] = np.random.choice(['Retail', 'SME', 'Cor'], df.shape[0])
How do I do it in pyspark?
In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .
You can add multiple columns to Spark DataFrame in several ways if you wanted to add a known set of columns you can easily do by chaining withColumn() or on select().
Just generate a list of values and then extract them randomly :
from pyspark.sql import functions as F
df.withColumn(
"business_vertical",
F.array(
F.lit("Retail"),
F.lit("SME"),
F.lit("Cor"),
).getItem(
(F.rand()*3).cast("int")
)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With