Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to create a new columns with random values in pyspark?

I tried to initialize new columns with random values in pandas. I did this way

df['business_vertical'] = np.random.choice(['Retail', 'SME', 'Cor'], df.shape[0])

How do I do it in pyspark?

like image 720
subash poudel Avatar asked Nov 28 '18 10:11

subash poudel


People also ask

How do you create a new column in PySpark?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do you add multiple columns in PySpark?

You can add multiple columns to Spark DataFrame in several ways if you wanted to add a known set of columns you can easily do by chaining withColumn() or on select().


1 Answers

Just generate a list of values and then extract them randomly :

from pyspark.sql import functions as F

df.withColumn(
  "business_vertical",
  F.array(
    F.lit("Retail"),
    F.lit("SME"),
    F.lit("Cor"),
  ).getItem(
    (F.rand()*3).cast("int")
  )
)
like image 116
Steven Avatar answered Sep 27 '22 23:09

Steven