I would like to <code>sum</code> (or perform other aggregate functions too) on the array column using SparkSQL. I have a table as <pre class="prettyprint"><code>+-------+-------+---------------------------------+ |dept_id|dept_nm| emp_details| +-------+-------+---------------------------------+ | 10|Finance| [100, 200, 300, 400, 500]| | 20| IT| [10, 20, 50, 100]| +-------+-------+---------------------------------+ </code></pre> I would like to sum the values of this <code>emp_details</code> column . Expected query: <pre class="prettyprint"><code>sqlContext.sql("select sum(emp_details) from mytable").show </code></pre> Expected result <pre class="prettyprint"><code>1500 180 </code></pre> Also I should be able to sum on the range elements too like : <pre class="prettyprint"><code>sqlContext.sql("select sum(slice(emp_details,0,3)) from mytable").show </code></pre> result <pre class="prettyprint"><code>600 80 </code></pre> when doing sum on the Array type as expected it shows that sum expects argument to be numeric type not array type. I think we need to create UDF for this . but how ? Will I be facing any performance hits with UDFs ? and is there any other solution apart from the UDF one ?

<h3>Spark 2.4.0</h3> As of Spark 2.4, Spark SQL supports higher-order functions that are to manipulate complex data structures, including arrays. The "modern" solution would be as follows: <pre class="prettyprint"><code>scala> input.show(false) +-------+-------+-------------------------+ |dept_id|dept_nm|emp_details | +-------+-------+-------------------------+ |10 |Finance|[100, 200, 300, 400, 500]| |20 |IT |[10, 20, 50, 100] | +-------+-------+-------------------------+ input.createOrReplaceTempView("mytable") val sqlText = "select dept_id, dept_nm, aggregate(emp_details, 0, (acc, value) -> acc + value) as sum from mytable" scala> sql(sqlText).show +-------+-------+----+ |dept_id|dept_nm| sum| +-------+-------+----+ | 10|Finance|1500| | 20| IT| 180| +-------+-------+----+ </code></pre> You can find a good reading on higher-order functions in the following articles and video: <ol> <li>Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4</li> <li>Working with Nested Data Using Higher Order Functions in SQL on Databricks</li> <li>An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell (Databricks)</li> </ol> <h3>Spark 2.3.2 and earlier</h3> DISCLAIMER I would not recommend this approach (even though it got the most upvotes) because of the deserialization that Spark SQL does to execute <code>Dataset.map</code>. The query forces Spark to deserialize the data and load it onto JVM (from memory regions that are managed by Spark outside JVM). That will inevitably lead to more frequent GCs and hence make performance worse. One solution would be to use <code>Dataset</code> solution where the combination of Spark SQL and Scala could show its power. <pre class="prettyprint"><code>scala> val inventory = Seq( | (10, "Finance", Seq(100, 200, 300, 400, 500)), | (20, "IT", Seq(10, 20, 50, 100))).toDF("dept_id", "dept_nm", "emp_details") inventory: org.apache.spark.sql.DataFrame = [dept_id: int, dept_nm: string ... 1 more field] // I'm too lazy today for a case class scala> inventory.as[(Long, String, Seq[Int])]. map { case (deptId, deptName, details) => (deptId, deptName, details.sum) }. toDF("dept_id", "dept_nm", "sum"). show +-------+-------+----+ |dept_id|dept_nm| sum| +-------+-------+----+ | 10|Finance|1500| | 20| IT| 180| +-------+-------+----+ </code></pre> I'm leaving the slice part as an exercise as it's equally simple.

How to slice and sum elements of array column?

Tags:

scala

apache-spark

apache-spark-sql

I would like to sum (or perform other aggregate functions too) on the array column using SparkSQL.

I have a table as

+-------+-------+---------------------------------+
|dept_id|dept_nm|                      emp_details|
+-------+-------+---------------------------------+
|     10|Finance|        [100, 200, 300, 400, 500]|
|     20|     IT|                [10, 20, 50, 100]|
+-------+-------+---------------------------------+

I would like to sum the values of this emp_details column .

Expected query:

sqlContext.sql("select sum(emp_details) from mytable").show

Expected result

1500
180

Also I should be able to sum on the range elements too like :

sqlContext.sql("select sum(slice(emp_details,0,3)) from mytable").show

result

600
80

when doing sum on the Array type as expected it shows that sum expects argument to be numeric type not array type.

I think we need to create UDF for this . but how ?

Will I be facing any performance hits with UDFs ? and is there any other solution apart from the UDF one ?

919

asked Oct 20 '16 09:10

serious_black

1 Answers

Spark 2.4.0

As of Spark 2.4, Spark SQL supports higher-order functions that are to manipulate complex data structures, including arrays.

The "modern" solution would be as follows:

scala> input.show(false)
+-------+-------+-------------------------+
|dept_id|dept_nm|emp_details              |
+-------+-------+-------------------------+
|10     |Finance|[100, 200, 300, 400, 500]|
|20     |IT     |[10, 20, 50, 100]        |
+-------+-------+-------------------------+

input.createOrReplaceTempView("mytable")

val sqlText = "select dept_id, dept_nm, aggregate(emp_details, 0, (acc, value) -> acc + value) as sum from mytable"
scala> sql(sqlText).show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
|     10|Finance|1500|
|     20|     IT| 180|
+-------+-------+----+

You can find a good reading on higher-order functions in the following articles and video:

Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4
Working with Nested Data Using Higher Order Functions in SQL on Databricks
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell (Databricks)

Spark 2.3.2 and earlier

DISCLAIMER I would not recommend this approach (even though it got the most upvotes) because of the deserialization that Spark SQL does to execute Dataset.map. The query forces Spark to deserialize the data and load it onto JVM (from memory regions that are managed by Spark outside JVM). That will inevitably lead to more frequent GCs and hence make performance worse.

One solution would be to use Dataset solution where the combination of Spark SQL and Scala could show its power.

scala> val inventory = Seq(
     |   (10, "Finance", Seq(100, 200, 300, 400, 500)),
     |   (20, "IT", Seq(10, 20, 50, 100))).toDF("dept_id", "dept_nm", "emp_details")
inventory: org.apache.spark.sql.DataFrame = [dept_id: int, dept_nm: string ... 1 more field]

// I'm too lazy today for a case class
scala> inventory.as[(Long, String, Seq[Int])].
  map { case (deptId, deptName, details) => (deptId, deptName, details.sum) }.
  toDF("dept_id", "dept_nm", "sum").
  show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
|     10|Finance|1500|
|     20|     IT| 180|
+-------+-------+----+

I'm leaving the slice part as an exercise as it's equally simple.

answered Nov 12 '22 15:11

Jacek Laskowski

Related questions
                            
                                How to find value closest to List of values?
                            
                                When to use countByValue and when to use map().reduceByKey()
                            
                                Why can't I call methods on a for-yield expression?
                            
                                Is there are way to create method level constants without namespace polution?
                            
                                Scala: Pattern matching Seq[Nothing]
                            
                                scala two options defined, return first, else return second
                            
                                Remove element at given index
                            
                                Difference between two rows in Spark dataframe
                            
                                Add leading zeros to Columns in a Spark Data Frame [duplicate]
                            
                                How to do fast prefix string matching in Scala
                            
                                Scala way to change this into a list?
                            
                                Pattern match for Integer range
                            
                                Scala, make my loop more functional
                            
                                Shorten code which handles IO
                            
                                Idiomatically evaluate true if Option contains specific value
                            
                                Sort scala arrayBuffer of TimeStamp
                            
                                Is a way to return Future[Unit]?
                            
                                Get only Right values from Either sequence
                            
                                How to strip everything except digits from a string in Scala (quick one liners)
                            
                                Difference in definition of Actors vs Threads? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With