SparkSQL DataFrame order by across partitions

Tags:

I'm using spark sql to run a query over my dataset. The result of the query is pretty small but still partitioned.

I would like to coalesce the resulting DataFrame and order the rows by a column. I tried

DataFrame result = sparkSQLContext.sql("my sql").coalesce(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")

I also tried

DataFrame result = sparkSQLContext.sql("my sql").repartition(1).orderBy("col1")
result.toJSON().saveAsTextFile("output")

the output file is ordered in chunks (i.e. the partitions are ordered, but the data frame is not ordered as a whole). For example, instead of

1, value
2, value
4, value
4, value
5, value
5, value
...

I get

2, value
4, value
5, value
-----------> partition boundary
1, value
4, value
5, value

What is the correct way to get an absolute ordering of my query result?
Why isn't the data frame being coalesced into a single partition?

868

asked Jul 31 '15 01:07

fo_x86

1 Answers

I want to mention couple of things here . 1- the source code shows that the orderBy statement internally calls the sorting api with global ordering set to true .So the lack of ordering at the level of the output suggests that the ordering was lost while writing into the target. My point is that a call to orderBy always requires global order.

2- Using a drastic coalesce , as in forcing a single partition in your case , can be really dangerous. I would recommend you do not do that. The source code suggests that calling coalesce(1) can potentially cause upstream transformations to use a single partition . This would be brutal performance wise.

3- You seem to expect the orderBy statement to be executed with a single partition. I do not think that i agree with that statement. That would make Spark a really silly distributed framework.

Community please let me know if you agree or disagree with statements.

how are you collecting data from the output anyway?

maybe the output actually contains sorted data , but the transformations /actions that you performed in order to read from the output is responsible for the order lost.

144

answered Oct 12 '22 11:10

JavaPlanet

Related questions
                            
                                Spark ClassNotFoundException running the master
                            
                                how does pyspark broadcast variables work
                            
                                Checking for equality of RDDs
                            
                                Equivalent to getLines in Apache Spark RDD
                            
                                Spark Cassandra Connector keyBy and shuffling
                            
                                Is this a regression bug in Spark 1.3?
                            
                                Computing Pointwise Mutual Information in Spark
                            
                                Save Spark org.apache.spark.mllib.linalg.Matrix to a file
                            
                                Spark SQL - PostgreSQL JDBC Classpath Issues
                            
                                Does caching in spark streaming increase performance
                            
                                Proper way to make a Spark Fat Jar using SBT
                            
                                How to get good performance on reading cassandra partitions in spark?
                            
                                Are recursive computations with Apache Spark RDD possible?
                            
                                Spark-submit class not found exception
                            
                                Loading bigger than memory hdf5 file in pyspark
                            
                                What operations of spark is processed in parallel?
                            
                                Spark MlLib linear regression (Linear least squares) giving random results
                            
                                Spark RDD's - how do they work
                            
                                Spark SQL DataFrame - distinct() vs dropDuplicates()
                            
                                How to fix Connection reset by peer message from apache-spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SparkSQL DataFrame order by across partitions

Tags:

apache-spark

apache-spark-sql

spark-dataframe

fo_x86

People also ask

1 Answers

JavaPlanet

Recent Activity

Donate For Us