<p>Let's say I have a numpy array <code>a</code> that contains the numbers 1-10:<br><code>[1 2 3 4 5 6 7 8 9 10]</code></p> <p>I also have a Spark dataframe to which I want to add my numpy array <code>a</code>. I figure that a column of literals will do the job. This doesn't work:</p> <pre class="prettyprint"><code>df = df.withColumn("NewColumn", F.lit(a)) </code></pre> <blockquote> <p>Unsupported literal type class java.util.ArrayList</p> </blockquote> <p>But this works:</p> <pre class="prettyprint"><code>df = df.withColumn("NewColumn", F.lit(a[0])) </code></pre> <p>How to do it?</p> <p>Example DF before:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th>col1</th> </tr></thead> <tbody><tr> <td>a b c d e f g h i j</td> </tr></tbody> </table> </div> <p>Expected result:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th>col1</th> <th>NewColumn</th> </tr></thead> <tbody><tr> <td>a b c d e f g h i j</td> <td>1 2 3 4 5 6 7 8 9 10</td> </tr></tbody> </table> </div>

<h3>List comprehension inside Spark's <code>array</code> </h3> <pre class="prettyprint"><code>a = [1,2,3,4,5,6,7,8,9,10] df = spark.createDataFrame([['a b c d e f g h i j '],], ['col1']) df = df.withColumn("NewColumn", F.array([F.lit(x) for x in a])) df.show(truncate=False) df.printSchema() # +--------------------+-------------------------------+ # |col1 |NewColumn | # +--------------------+-------------------------------+ # |a b c d e f g h i j |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]| # +--------------------+-------------------------------+ # root # |-- col1: string (nullable = true) # |-- NewColumn: array (nullable = false) # | |-- element: integer (containsNull = false) </code></pre> <p>@pault commented <strong>(Python 2.7)</strong>:</p> <blockquote> <p>You can hide the loop using <code>map</code>:<br><code>df.withColumn("NewColumn", F.array(map(F.lit, a)))</code></p> </blockquote> <p>@ abegehr added <strong>Python 3</strong> version:</p> <blockquote> <p><code>df.withColumn("NewColumn", F.array(*map(F.lit, a)))</code></p> </blockquote> <h3>Spark's <code>udf</code> </h3> <pre class="prettyprint"><code># Defining UDF def arrayUdf(): return a callArrayUdf = F.udf(arrayUdf, T.ArrayType(T.IntegerType())) # Calling UDF df = df.withColumn("NewColumn", callArrayUdf()) </code></pre> <p>Output is the same.</p>

Passing Array to Spark Lit function

df = df.withColumn("NewColumn", F.lit(a))

Unsupported literal type class java.util.ArrayList

But this works:

df = df.withColumn("NewColumn", F.lit(a[0]))

How to do it?

Example DF before:

col1
a b c d e f g h i j

Expected result:

col1	NewColumn
a b c d e f g h i j	1 2 3 4 5 6 7 8 9 10

526

asked Apr 06 '18 01:04

A. R.

1 Answers

List comprehension inside Spark's `array`

a = [1,2,3,4,5,6,7,8,9,10] df = spark.createDataFrame([['a b c d e f g h i j '],], ['col1']) df = df.withColumn("NewColumn", F.array([F.lit(x) for x in a]))  df.show(truncate=False) df.printSchema() #  +--------------------+-------------------------------+ #  |col1                |NewColumn                      | #  +--------------------+-------------------------------+ #  |a b c d e f g h i j |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]| #  +--------------------+-------------------------------+ #  root #   |-- col1: string (nullable = true) #   |-- NewColumn: array (nullable = false) #   |    |-- element: integer (containsNull = false)

@pault commented (Python 2.7):

You can hide the loop using map:
df.withColumn("NewColumn", F.array(map(F.lit, a)))

@ abegehr added Python 3 version:

df.withColumn("NewColumn", F.array(*map(F.lit, a)))

Spark's `udf`

# Defining UDF def arrayUdf():     return a callArrayUdf = F.udf(arrayUdf, T.ArrayType(T.IntegerType()))  # Calling UDF df = df.withColumn("NewColumn", callArrayUdf())

Output is the same.

103

answered Sep 16 '22 17:09

Ramesh Maharjan

Related questions
                            
                                Pandas: Subindexing dataframes: Copies vs views
                            
                                Why does str(float) return more digits in Python 3 than Python 2?
                            
                                Pandas: Rename single DataFrame column without knowing column name
                            
                                Python Multiprocessing Locks
                            
                                override python function-local variable in unittest
                            
                                What is the cv2.cv replacement in OpenCV3?
                            
                                How can node.js be faster than c and java? Benchmark comparing node.js, c, java and python
                            
                                What is the difference between the predict and predict_on_batch methods of a Keras model?
                            
                                How the write(), read() and getvalue() methods of Python io.BytesIO work?
                            
                                AttributeError: module 'seaborn' has no attribute 'histplot'
                            
                                Python interactive mode history and arrow keys
                            
                                Get page generated with Javascript in Python
                            
                                How to mouseover in python Webdriver
                            
                                Python Matplotlib Basemap overlay small image on map plot
                            
                                Python using getattr to call function with variable parameters
                            
                                Python threading: can I sleep on two threading.Event()s simultaneously?
                            
                                Getting information for bins in matplotlib histogram function
                            
                                What is the reliable method to find most time consuming part of the code?
                            
                                Short way to convert string to int
                            
                                pandas.DatetimeIndex frequency is None and can't be set

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Passing Array to Spark Lit function

Tags:

python

literals

apache-spark

apache-spark-sql

pyspark

A. R.

People also ask

1 Answers

List comprehension inside Spark's `array`

Spark's `udf`

Ramesh Maharjan

Recent Activity

Donate For Us

Passing Array to Spark Lit function

Tags:

python

literals

apache-spark

apache-spark-sql

pyspark

A. R.

People also ask

1 Answers

List comprehension inside Spark's array

Spark's udf

Ramesh Maharjan

Related questions

Recent Activity

Donate For Us

List comprehension inside Spark's `array`

Spark's `udf`