Imagine that I have the following DataFrame df: <pre class="prettyprint"><code>+---+-----------+------------+ | id|featureName|featureValue| +---+-----------+------------+ |id1| a| 3| |id1| b| 4| |id2| a| 2| |id2| c| 5| |id3| d| 9| +---+-----------+------------+ </code></pre> Imagine that I run: <pre class="prettyprint"><code>df.groupBy("id") .agg(collect_list($"featureIndex").as("idx"), collect_list($"featureValue").as("val")) </code></pre> Am I GUARANTEED that "idx" and "val" will be aggregated and keep their relative order? i.e. <pre class="prettyprint"><code>GOOD GOOD BAD +---+------+------+ +---+------+------+ +---+------+------+ | id| idx| val| | id| idx| val| | id| idx| val| +---+------+------+ +---+------+------+ +---+------+------+ |id3| [d]| [9]| |id3| [d]| [9]| |id3| [d]| [9]| |id1|[a, b]|[3, 4]| |id1|[b, a]|[4, 3]| |id1|[a, b]|[4, 3]| |id2|[a, c]|[2, 5]| |id2|[c, a]|[5, 2]| |id2|[a, c]|[5, 2]| +---+------+------+ +---+------+------+ +---+------+------+ </code></pre> NOTE: e.g. It's BAD because for id1 [a, b] should have been associated with [3, 4] (and not [4, 3]). Same for id2

I think you can rely on "their relative order" as Spark goes over rows one by one in order (and usually does not re-order rows if not explicitly needed). If you are concerned with the order, merge these two columns using struct function before doing <code>groupBy</code>. <blockquote> struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns. </blockquote> You could also use monotonically_increasing_id function to number records and use it to pair with the other columns (perhaps using <code>struct</code>): <blockquote> monotonically_increasing_id(): Column A column expression that generates monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. </blockquote>

Does collect_list() maintain relative ordering of rows?

Tags:

scala

apache-spark

apache-spark-sql

Imagine that I have the following DataFrame df:

+---+-----------+------------+
| id|featureName|featureValue|
+---+-----------+------------+
|id1|          a|           3|
|id1|          b|           4|
|id2|          a|           2|
|id2|          c|           5|
|id3|          d|           9|
+---+-----------+------------+

Imagine that I run:

df.groupBy("id")
  .agg(collect_list($"featureIndex").as("idx"),
       collect_list($"featureValue").as("val"))

Am I GUARANTEED that "idx" and "val" will be aggregated and keep their relative order? i.e.

GOOD                   GOOD                   BAD
+---+------+------+    +---+------+------+    +---+------+------+
| id|   idx|   val|    | id|   idx|   val|    | id|   idx|   val|
+---+------+------+    +---+------+------+    +---+------+------+
|id3|   [d]|   [9]|    |id3|   [d]|   [9]|    |id3|   [d]|   [9]|
|id1|[a, b]|[3, 4]|    |id1|[b, a]|[4, 3]|    |id1|[a, b]|[4, 3]|
|id2|[a, c]|[2, 5]|    |id2|[c, a]|[5, 2]|    |id2|[a, c]|[5, 2]|
+---+------+------+    +---+------+------+    +---+------+------+

NOTE: e.g. It's BAD because for id1 [a, b] should have been associated with [3, 4] (and not [4, 3]). Same for id2

300

asked Jun 09 '17 01:06

Marsellus Wallace

1 Answers

I think you can rely on "their relative order" as Spark goes over rows one by one in order (and usually does not re-order rows if not explicitly needed).

If you are concerned with the order, merge these two columns using struct function before doing groupBy.

struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.

You could also use monotonically_increasing_id function to number records and use it to pair with the other columns (perhaps using struct):

monotonically_increasing_id(): Column A column expression that generates monotonically increasing 64-bit integers.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

answered Oct 13 '22 23:10

Jacek Laskowski

Related questions
                            
                                Is there a way to handle the last case differently in a Scala for loop?
                            
                                Scala: difference between a typeclass and an ADT?
                            
                                How to call a method n times in Scala?
                            
                                Code to enumerate permutations in Scala
                            
                                coin change algorithm in scala using recursion
                            
                                Testing with probabilistic failure of components in Akka (Scala)
                            
                                Spark Random Forests: Different results with same seed
                            
                                Polymorphic updates in an immutable class hierarchy
                            
                                Scala / Slick, "Timeout after 20000ms of waiting for a connection" error
                            
                                scala actors vs threads and blocking IO
                            
                                What is the best way to perform OAuth2 authentication using akka-http?
                            
                                Optimizing Slick generated SQL query
                            
                                (SBT) How to disable default resolver and only use the company internal resolver?
                            
                                JVM OutOfMemory error "death spiral" (not memory leak)
                            
                                Enumeration concept in Scala - Which option to take?
                            
                                Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk
                            
                                Getting the value of a SettingKey[T]
                            
                                Validation versus disjunction
                            
                                Function composition, Kleisli arrow, and Monadic laws

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With