In Spark 1.6.0 / Scala, is there an opportunity to get <code>collect_list("colC")</code> or <code>collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")</code>?

Given that you have <code>dataframe</code> as <pre class="prettyprint"><code>+----+----+----+ |colA|colB|colC| +----+----+----+ |1 |1 |23 | |1 |2 |63 | |1 |3 |31 | |2 |1 |32 | |2 |2 |56 | +----+----+----+ </code></pre> You can <code>Window</code> functions by doing the following <pre class="prettyprint"><code>import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions._ df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false) </code></pre> Result: <pre class="prettyprint"><code>+----+----+----+------------+ |colA|colB|colC|colD | +----+----+----+------------+ |1 |1 |23 |[23] | |1 |2 |63 |[23, 63] | |1 |3 |31 |[23, 63, 31]| |2 |1 |32 |[32] | |2 |2 |56 |[32, 56] | +----+----+----+------------+ </code></pre> Similar is the result for <code>collect_set</code> as well. But the order of elements in the final <code>set</code> will not be in order as with <code>collect_list</code> <pre class="prettyprint"><code>df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false) +----+----+----+------------+ |colA|colB|colC|colD | +----+----+----+------------+ |1 |1 |23 |[23] | |1 |2 |63 |[63, 23] | |1 |3 |31 |[63, 31, 23]| |2 |1 |32 |[32] | |2 |2 |56 |[56, 32] | +----+----+----+------------+ </code></pre> If you remove <code>orderBy</code> as below <pre class="prettyprint"><code>df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false) </code></pre> result would be <pre class="prettyprint"><code>+----+----+----+------------+ |colA|colB|colC|colD | +----+----+----+------------+ |1 |1 |23 |[23, 63, 31]| |1 |2 |63 |[23, 63, 31]| |1 |3 |31 |[23, 63, 31]| |2 |1 |32 |[32, 56] | |2 |2 |56 |[32, 56] | +----+----+----+------------+ </code></pre> I hope the answer is helpful

How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?

1 Answers

Given that you have dataframe as

+----+----+----+
|colA|colB|colC|
+----+----+----+
|1   |1   |23  |
|1   |2   |63  |
|1   |3   |31  |
|2   |1   |32  |
|2   |2   |56  |
+----+----+----+

You can Window functions by doing the following

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)

Result:

+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23]        |
|1   |2   |63  |[23, 63]    |
|1   |3   |31  |[23, 63, 31]|
|2   |1   |32  |[32]        |
|2   |2   |56  |[32, 56]    |
+----+----+----+------------+

Similar is the result for collect_set as well. But the order of elements in the final set will not be in order as with collect_list

df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23]        |
|1   |2   |63  |[63, 23]    |
|1   |3   |31  |[63, 31, 23]|
|2   |1   |32  |[32]        |
|2   |2   |56  |[56, 32]    |
+----+----+----+------------+

If you remove orderBy as below

df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)

result would be

+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23, 63, 31]|
|1   |2   |63  |[23, 63, 31]|
|1   |3   |31  |[23, 63, 31]|
|2   |1   |32  |[32, 56]    |
|2   |2   |56  |[32, 56]    |
+----+----+----+------------+

I hope the answer is helpful

answered Sep 21 '22 16:09

Ramesh Maharjan

Related questions
                            
                                Scala: What does mean to pass a Set to the map function of a set
                            
                                Tail-recursive bounded stream of pairs of integers (Scala)?
                            
                                What does this import exactly mean in Scala?
                            
                                Pausing an actor in Akka
                            
                                How can I parse out get request parameters in spray-routing?
                            
                                How do I wait for a Scala future's onSuccess callback to complete?
                            
                                Get Scala variable name at runtime
                            
                                What's the use case of secondary constructors in abstract classes?
                            
                                Split String into alternating words (Scala)
                            
                                How do I initialize object vals with values known only at runtime?
                            
                                Compose Scalaz validations
                            
                                Scala split string to tuple
                            
                                Understanding mutable Seq
                            
                                Is there an efficiency penalty when using Scala inner functions within non-tail recursive functions?
                            
                                Matching (and binding) two exception classes in one case statement in Scala 2.7?
                            
                                Abstract classes, why can't we declare private val and var class member?
                            
                                Scala Covariance and Lower Type Bounds Explanation
                            
                                Scala - How to compile code from an external file at runtime?
                            
                                How to flatten list inside RDD?
                            
                                SPARK/SQL:spark can't resolve symbol toDF

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?

Tags:

scala

apache-spark

apache-spark-sql

apache-spark-1.6

Dzmitry Haikov

People also ask

1 Answers

Ramesh Maharjan

Recent Activity

Donate For Us