Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Dataframe joins in Spark preserve order?

I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes.

From Which operations preserve RDD order?, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions.

How could one perform a join of two DataFrames while preserving the order of one table?

E.g.,

+------------+---------+ | col1 | col2 | +------------+---------+ | 0 | a | | 1 | b | +------------+---------+

joined with

+------------+---------+ | col2 | col3 | +------------+---------+ | b | x | | a | y | +------------+---------+

on col2 should give

+------------+--------------------+ | col1 | col2 | col 3 | +------------+---------+----------+ | 0 | a | y | | 1 | b | x | +------------+---------+----------+

I've heard some things about using coalesce or repartition, but I'm not sure. Any suggestions/methods/insights are appreciated.

Edit: would this be analogous to having one reducer in MapReduce? If so, how would that look like in Spark?

like image 882
jest jest Avatar asked Jun 28 '16 20:06

jest jest


People also ask

Does join order matter in Spark?

It does not make a difference, in spark the RDD will only be brought into memory if it is cached.

Does groupBy preserve order Spark?

groupBy after orderBy doesn't maintain order, as others have pointed out. What you want to do is use a Window function, partitioned on id and ordered by hours.

Are Spark DataFrames ordered?

Alternatively, Spark DataFrame/Dataset class also provides orderBy() function to sort on one or more columns. By default, it also orders by ascending.


1 Answers

It can't. You can add monotonically_increasing_id and reorder data after join.

like image 127
user6022341 Avatar answered Sep 19 '22 00:09

user6022341