Can Dataframe joins in Spark preserve order?

Tags:

I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes.

From Which operations preserve RDD order?, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions.

How could one perform a join of two DataFrames while preserving the order of one table?

E.g.,

+------------+---------+ | col1 | col2 | +------------+---------+ | 0 | a | | 1 | b | +------------+---------+

joined with

+------------+---------+ | col2 | col3 | +------------+---------+ | b | x | | a | y | +------------+---------+

on col2 should give

+------------+--------------------+ | col1 | col2 | col 3 | +------------+---------+----------+ | 0 | a | y | | 1 | b | x | +------------+---------+----------+

I've heard some things about using coalesce or repartition, but I'm not sure. Any suggestions/methods/insights are appreciated.

Edit: would this be analogous to having one reducer in MapReduce? If so, how would that look like in Spark?

882

asked Jun 28 '16 20:06

jest jest

1 Answers

It can't. You can add monotonically_increasing_id and reorder data after join.

127

answered Sep 19 '22 00:09

user6022341

Related questions
                            
                                Get a range of columns of Spark RDD
                            
                                Ever increasing physical memory for a Spark application in YARN
                            
                                Best practice for integrating Kafka and HBase
                            
                                How to persist sorted parquet tables for future sort merge joins?
                            
                                Exception running /etc/hadoop/conf.cloudera.yarn/topology.py
                            
                                Will there be any scenario, where Spark RDD's fail to satisfy immutability.?
                            
                                Error creating transactional connection factory during running Spark on Hive project in IDEA
                            
                                Understanding resource allocation for spark jobs on mesos
                            
                                Where Spark RDD lineage is stored?
                            
                                How to do custom operations on GroupedData in Spark?
                            
                                Applying IndexToString to features vector in Spark
                            
                                Spark/Hadoop - Not able to save to s3 with server side encryption
                            
                                Wrapping a java function in pyspark
                            
                                Spark 1.6 apply function to column with dot in name/ How to properly escape colName
                            
                                Split RDD for K-fold validation: pyspark
                            
                                How to Reference Spark Broadcast Variables Outside of Scope
                            
                                SPARK DataFrame: Remove MAX value in a group
                            
                                How to setup Apache Spark to use local hard disk when data does not fit in RAM in local mode?
                            
                                Read random sample of files on S3 with Pyspark
                            
                                How to parallelize Spark scala computation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can Dataframe joins in Spark preserve order?

Tags:

dataframe

apache-spark

spark-dataframe

jest jest

People also ask

1 Answers

user6022341

Recent Activity

Donate For Us