Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between sort and orderBy functions in Spark

What is the difference between sort and orderBy spark DataFrame?

scala> zips.printSchema
root
 |-- _id: string (nullable = true)
 |-- city: string (nullable = true)
 |-- loc: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- pop: long (nullable = true)
 |-- state: string (nullable = true)

Below commands produce same result:

zips.sort(desc("pop")).show
zips.orderBy(desc("pop")).show
like image 896
Rahul Sharma Avatar asked Nov 15 '16 06:11

Rahul Sharma


People also ask

What is difference between orderBy and sortBy?

Difference between Sort By and Order By The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

How does orderBy work Spark?

In Spark, you can use either sort() or orderBy() function of DataFrame/Dataset to sort by ascending or descending order based on single or multiple columns, you can also do sorting using Spark SQL sorting functions, In this article, I will explain all these different ways using Scala examples.

What is orderBy in PySpark?

PySpark orderby is a spark sorting function used to sort the data frame / RDD in a PySpark Framework. It is used to sort one more column in a PySpark Data Frame. The Desc method is used to order the elements in descending order.

What is Spark sort?

Sorting in Spark is a multiphase process which requires shuffling: input RDD is sampled and this sample is used to compute boundaries for each output partition ( sample followed by collect ) input RDD is partitioned using rangePartitioner with boundaries computed in the first step ( partitionBy )


3 Answers

OrderBy is just an alias for the sort function.

From the Spark documentation:

  /**
   * Returns a new Dataset sorted by the given expressions.
   * This is an alias of the `sort` function.
   *
   * @group typedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def orderBy(sortCol: String, sortCols: String*): Dataset[T] = sort(sortCol, sortCols : _*)
like image 63
Shivansh Avatar answered Oct 16 '22 02:10

Shivansh


They are NOT the SAME.

The SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered.

Reference :https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sortby.html

The ORDER BY clause is used to return the result rows in a sorted manner in the user specified order. Unlike the SORT BY clause, this clause guarantees a total order in the output.

Reference : https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-orderby.html

like image 29
RaHuL VeNuGoPaL Avatar answered Oct 16 '22 02:10

RaHuL VeNuGoPaL


sort() function sorts the output in each bucket by the given columns on the file system. It does not guaranty the order of output data. Whereas The orderBy() happens in two phase .

First inside each bucket using sortBy() then entire data has to be brought into a single executer for over all order in ascending order or descending order based on the specified column. It involves high shuffling and is a costly operation. But as

The sort() operation happen inside each an individual bucket and is a light weight operation.

Here is a example:

Preparing data

>>> listOfTuples = [(16,5000),(10,3000),(13,2600),(19,1800),(11,4000),(17,3100),(14,2500),(20,2000)]
>>> tupleRDD = sc.parallelize(listOfTuples,2)
>>> tupleDF = tupleRDD.toDF(["Id","Salary"])

The data looks like :

>>> tupleRDD.glom().collect()
[[(16, 5000), (10, 3000), (13, 2600), (19, 1800)], [(11, 4000), (17, 3100), (14, 2500), (20, 2000)]]
>>> tupleDF.show()
+---+------+
| Id|Salary|
+---+------+
| 16|  5000|
| 10|  3000|
| 13|  2600|
| 19|  1800|
| 11|  4000|
| 17|  3100|
| 14|  2500|
| 20|  2000|
+---+------+

Now the sort operation will be

>>> tupleDF.sort("id").show()
+---+------+
| Id|Salary|
+---+------+
| 10|  3000|
| 11|  4000|
| 13|  2600|
| 14|  2500|
| 16|  5000|
| 17|  3100|
| 19|  1800|
| 20|  2000|
+---+------+

See, the order is not as expected. Now if we see the orederBy operation :

>>> tupleDF.orderBy("id").show()
+---+------+
| Id|Salary|
+---+------+
| 10|  3000|
| 11|  4000|
| 13|  2600|
| 14|  2500|
| 16|  5000|
| 17|  3100|
| 19|  1800|
| 20|  2000|
+---+------+

It maintains the overall order of data.

like image 23
Paul Avatar answered Oct 16 '22 02:10

Paul