Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does collect_list() maintain relative ordering of rows?

Imagine that I have the following DataFrame df:

+---+-----------+------------+
| id|featureName|featureValue|
+---+-----------+------------+
|id1|          a|           3|
|id1|          b|           4|
|id2|          a|           2|
|id2|          c|           5|
|id3|          d|           9|
+---+-----------+------------+

Imagine that I run:

df.groupBy("id")
  .agg(collect_list($"featureIndex").as("idx"),
       collect_list($"featureValue").as("val"))

Am I GUARANTEED that "idx" and "val" will be aggregated and keep their relative order? i.e.

GOOD                   GOOD                   BAD
+---+------+------+    +---+------+------+    +---+------+------+
| id|   idx|   val|    | id|   idx|   val|    | id|   idx|   val|
+---+------+------+    +---+------+------+    +---+------+------+
|id3|   [d]|   [9]|    |id3|   [d]|   [9]|    |id3|   [d]|   [9]|
|id1|[a, b]|[3, 4]|    |id1|[b, a]|[4, 3]|    |id1|[a, b]|[4, 3]|
|id2|[a, c]|[2, 5]|    |id2|[c, a]|[5, 2]|    |id2|[a, c]|[5, 2]|
+---+------+------+    +---+------+------+    +---+------+------+

NOTE: e.g. It's BAD because for id1 [a, b] should have been associated with [3, 4] (and not [4, 3]). Same for id2

like image 300
Marsellus Wallace Avatar asked Jun 09 '17 01:06

Marsellus Wallace


People also ask

Does Collect_list maintain order?

Collect_list uses ArrayList, so the data will be kept in the same order they were added, to do that, uou need to use SORT BY clause in a subquery, don't use ORDER BY, it will cause your query to execute in a non-distributed way.

What does Collect_list do in spark?

The Spark function collect_list() is used to aggregate the values into an ArrayType typically after group by and window partition.

What does collect_ set do?

collect_set()It can be used either to group the values or aggregate them with the help of a windowing operation. In the following script, there exists a column called name and toolSet. When observed deeply, Employee 1 has three tools with two duplicates and Employee 2 has three tools with one duplicate.


1 Answers

I think you can rely on "their relative order" as Spark goes over rows one by one in order (and usually does not re-order rows if not explicitly needed).

If you are concerned with the order, merge these two columns using struct function before doing groupBy.

struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.

You could also use monotonically_increasing_id function to number records and use it to pair with the other columns (perhaps using struct):

monotonically_increasing_id(): Column A column expression that generates monotonically increasing 64-bit integers.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

like image 74
Jacek Laskowski Avatar answered Oct 13 '22 23:10

Jacek Laskowski