What is the difference between sort and orderBy functions in Spark

Tags:

3 Answers

OrderBy is just an alias for the sort function.

From the Spark documentation:

  /**
   * Returns a new Dataset sorted by the given expressions.
   * This is an alias of the `sort` function.
   *
   * @group typedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def orderBy(sortCol: String, sortCols: String*): Dataset[T] = sort(sortCol, sortCols : _*)

answered Oct 16 '22 02:10

The SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered.

Reference :https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sortby.html

The ORDER BY clause is used to return the result rows in a sorted manner in the user specified order. Unlike the SORT BY clause, this clause guarantees a total order in the output.

Reference : https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-orderby.html

answered Oct 16 '22 02:10

RaHuL VeNuGoPaL

sort() function sorts the output in each bucket by the given columns on the file system. It does not guaranty the order of output data. Whereas The orderBy() happens in two phase .

First inside each bucket using sortBy() then entire data has to be brought into a single executer for over all order in ascending order or descending order based on the specified column. It involves high shuffling and is a costly operation. But as

The sort() operation happen inside each an individual bucket and is a light weight operation.

Here is a example:

Preparing data

>>> listOfTuples = [(16,5000),(10,3000),(13,2600),(19,1800),(11,4000),(17,3100),(14,2500),(20,2000)]
>>> tupleRDD = sc.parallelize(listOfTuples,2)
>>> tupleDF = tupleRDD.toDF(["Id","Salary"])

The data looks like :

>>> tupleRDD.glom().collect()
[[(16, 5000), (10, 3000), (13, 2600), (19, 1800)], [(11, 4000), (17, 3100), (14, 2500), (20, 2000)]]
>>> tupleDF.show()
+---+------+
| Id|Salary|
+---+------+
| 16|  5000|
| 10|  3000|
| 13|  2600|
| 19|  1800|
| 11|  4000|
| 17|  3100|
| 14|  2500|
| 20|  2000|
+---+------+

Now the sort operation will be

>>> tupleDF.sort("id").show()
+---+------+
| Id|Salary|
+---+------+
| 10|  3000|
| 11|  4000|
| 13|  2600|
| 14|  2500|
| 16|  5000|
| 17|  3100|
| 19|  1800|
| 20|  2000|
+---+------+

See, the order is not as expected. Now if we see the orederBy operation :

>>> tupleDF.orderBy("id").show()
+---+------+
| Id|Salary|
+---+------+
| 10|  3000|
| 11|  4000|
| 13|  2600|
| 14|  2500|
| 16|  5000|
| 17|  3100|
| 19|  1800|
| 20|  2000|
+---+------+

It maintains the overall order of data.

answered Oct 16 '22 02:10

Paul

Related questions
                            
                                How does the pyspark mapPartitions function work?
                            
                                How to create dataframe from list in Spark SQL?
                            
                                Dropping a nested column from Spark DataFrame
                            
                                Skewed dataset join in Spark?
                            
                                How to use regex to include/exclude some input files in sc.textFile?
                            
                                Reading TSV into Spark Dataframe with Scala API
                            
                                spark createOrReplaceTempView vs createGlobalTempView
                            
                                How to calculate date difference in pyspark?
                            
                                How to convert Timestamp to Date format in DataFrame?
                            
                                Failed to Read Artifact Descriptor: IntelliJ
                            
                                Spark: How to kill running process without exiting shell?
                            
                                Syntax while setting schema for Pyspark.sql using StructType
                            
                                Efficient string matching in Apache Spark
                            
                                How to pass whole Row to UDF - Spark DataFrame filter
                            
                                How to perform one operation on each executor once in spark
                            
                                SPARK SQL - update MySql table using DataFrames and JDBC
                            
                                Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]
                            
                                How to Define Custom partitioner for Spark RDDs of equally sized partition where each partition has equal number of elements?
                            
                                Why does Spark job fail with "too many open files"?
                            
                                How do I run graphx with Python / pyspark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between sort and orderBy functions in Spark

Tags:

apache-spark

spark-dataframe

Rahul Sharma

People also ask

3 Answers

Shivansh

RaHuL VeNuGoPaL

Paul

Recent Activity

Donate For Us