Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark dataframe add new column with random data

I want to add a new column to the dataframe with values consist of either 0 or 1. I used 'randint' function from,

from random import randint

df1 = df.withColumn('isVal',randint(0,1))

But I get the following error,

/spark/python/pyspark/sql/dataframe.py", line 1313, in withColumn assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column

how to use a custom function or randint function for generate random value for the column?

like image 285
Dilma Avatar asked Jan 04 '17 08:01

Dilma


People also ask

How do I create a new column in Spark data frame?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do you generate 10 random numbers in Pyspark?

The randint() method to generates a whole number (integer). You can use randint(0,50) to generate a random number between 0 and 50. To generate random integers between 0 and 9, you can use the function randrange(min,max) . Change the parameters of randint() to generate a number between 1 and 10.

How do I add multiple columns in Spark?

You can add multiple columns to Spark DataFrame in several ways if you wanted to add a known set of columns you can easily do by chaining withColumn() or on select(). However, sometimes you may need to add multiple columns after applying some transformations n that case you can use either map() or foldLeft().


2 Answers

You are using python builtin random. This returns a specific value which is constant (the returned value).

As the error message shows, we expect a column which represents the expression.

To do this do:

from pyspark.sql.functions import rand,when
df1 = df.withColumn('isVal', when(rand() > 0.5, 1).otherwise(0))

This would give a uniform distribution between 0 and 1. See the functions documentation for more options (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)

like image 194
Assaf Mendelson Avatar answered Oct 16 '22 09:10

Assaf Mendelson


Had a similar problem with integer values from 5 to 10. I've used the rand() function from pyspark.sql.functions

from pyspark.sql.functions import *
df1 = df.withColumn("random", round(rand()*(10-5)+5,0))
like image 7
gogogod Avatar answered Oct 16 '22 08:10

gogogod