Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing Array to Spark Lit function

Let's say I have a numpy array a that contains the numbers 1-10:
[1 2 3 4 5 6 7 8 9 10]

I also have a Spark dataframe to which I want to add my numpy array a. I figure that a column of literals will do the job. This doesn't work:

df = df.withColumn("NewColumn", F.lit(a)) 

Unsupported literal type class java.util.ArrayList

But this works:

df = df.withColumn("NewColumn", F.lit(a[0])) 

How to do it?

Example DF before:

col1
a b c d e f g h i j

Expected result:

col1 NewColumn
a b c d e f g h i j 1 2 3 4 5 6 7 8 9 10
like image 526
A. R. Avatar asked Apr 06 '18 01:04

A. R.


People also ask

What is lit () in PySpark?

The PySpark SQL functions lit() are used to add a new column to the DataFrame by assigning a literal or constant value.

How do you define an array in PySpark?

Create PySpark ArrayType You can create an instance of an ArrayType using ArraType() class, This takes arguments valueType and one optional argument valueContainsNull to specify if a value can accept null, by default it takes True. valueType should be a PySpark type that extends DataType class.


1 Answers

List comprehension inside Spark's array

a = [1,2,3,4,5,6,7,8,9,10] df = spark.createDataFrame([['a b c d e f g h i j '],], ['col1']) df = df.withColumn("NewColumn", F.array([F.lit(x) for x in a]))  df.show(truncate=False) df.printSchema() #  +--------------------+-------------------------------+ #  |col1                |NewColumn                      | #  +--------------------+-------------------------------+ #  |a b c d e f g h i j |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]| #  +--------------------+-------------------------------+ #  root #   |-- col1: string (nullable = true) #   |-- NewColumn: array (nullable = false) #   |    |-- element: integer (containsNull = false) 

@pault commented (Python 2.7):

You can hide the loop using map:
df.withColumn("NewColumn", F.array(map(F.lit, a)))

@ abegehr added Python 3 version:

df.withColumn("NewColumn", F.array(*map(F.lit, a)))

Spark's udf

# Defining UDF def arrayUdf():     return a callArrayUdf = F.udf(arrayUdf, T.ArrayType(T.IntegerType()))  # Calling UDF df = df.withColumn("NewColumn", callArrayUdf()) 

Output is the same.

like image 103
Ramesh Maharjan Avatar answered Sep 16 '22 17:09

Ramesh Maharjan