Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference in dense rank and row number in spark

Tags:

apache-spark

I tried to understand the difference between dense rank and row number.Each new window partition both is starting from 1. Does rank of a row is not always start from 1 ? Any help would be appreciated

like image 304
John Avatar asked Jul 07 '17 10:07

John


People also ask

What is the difference between row_number rank and Dense_rank?

The row_number gives continuous numbers, while rank and dense_rank give the same rank for duplicates, but the next number in rank is as per continuous order so you will see a jump but in dense_rank doesn't have any gap in rankings.

What is difference between Rownum and rank?

The difference between RANK() and ROW_NUMBER() is that RANK() skips duplicate values. When there are duplicate values, the same ranking is assigned, and a gap appears in the sequence for each duplicate ranking.

What is dense rank in Spark?

dense_rank() window function is used to get the result with rank of rows within a window partition without any gaps. This is similar to rank() function difference being rank function leaves gaps in rank when there are ties.

What is the difference between rank and dense?

RANK and DENSE_RANK will assign the grades the same rank depending on how they fall compared to the other values. However, RANK will then skip the next available ranking value whereas DENSE_RANK would still use the next chronological ranking value.


2 Answers

The difference is when there are "ties" in the ordering column. Check the example below:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val df = Seq(("a", 10), ("a", 10), ("a", 20)).toDF("col1", "col2")

val windowSpec = Window.partitionBy("col1").orderBy("col2")

df
  .withColumn("rank", rank().over(windowSpec))
  .withColumn("dense_rank", dense_rank().over(windowSpec))
  .withColumn("row_number", row_number().over(windowSpec)).show

+----+----+----+----------+----------+
|col1|col2|rank|dense_rank|row_number|
+----+----+----+----------+----------+
|   a|  10|   1|         1|         1|
|   a|  10|   1|         1|         2|
|   a|  20|   3|         2|         3|
+----+----+----+----------+----------+

Note that the value "10" exists twice in col2 within the same window (col1 = "a"). That's when you see a difference between the three functions.

like image 139
Daniel de Paula Avatar answered Oct 07 '22 00:10

Daniel de Paula


I'm showing @Daniel's answer in Python and I'm adding a comparison with count('*') that can be used if you want to get top-n at most rows per group.

from pyspark.sql.session import SparkSession
from pyspark.sql import Window
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    ['a', 10], ['a', 20], ['a', 30],
    ['a', 40], ['a', 40], ['a', 40], ['a', 40],
    ['a', 50], ['a', 50], ['a', 60]], ['part_col', 'order_col'])

window = Window.partitionBy("part_col").orderBy("order_col")

df = (df
  .withColumn("rank", F.rank().over(window))
  .withColumn("dense_rank", F.dense_rank().over(window))
  .withColumn("row_number", F.row_number().over(window))
  .withColumn("count", F.count('*').over(window))
)

df.show()

+--------+---------+----+----------+----------+-----+
|part_col|order_col|rank|dense_rank|row_number|count|
+--------+---------+----+----------+----------+-----+
|       a|       10|   1|         1|         1|    1|
|       a|       20|   2|         2|         2|    2|
|       a|       30|   3|         3|         3|    3|
|       a|       40|   4|         4|         4|    7|
|       a|       40|   4|         4|         5|    7|
|       a|       40|   4|         4|         6|    7|
|       a|       40|   4|         4|         7|    7|
|       a|       50|   8|         5|         8|    9|
|       a|       50|   8|         5|         9|    9|
|       a|       60|  10|         6|        10|   10|
+--------+---------+----+----------+----------+-----+

For example if you want to take at most 4 without randomly picking one of the 4 "40" of the sorting column:

df.where("count <= 4").show()

+--------+---------+----+----------+----------+-----+
|part_col|order_col|rank|dense_rank|row_number|count|
+--------+---------+----+----------+----------+-----+
|       a|       10|   1|         1|         1|    1|
|       a|       20|   2|         2|         2|    2|
|       a|       30|   3|         3|         3|    3|
+--------+---------+----+----------+----------+-----+

In summary, if you filter <= n those columns you will get:

  • rank at least n rows
  • dense_rank at least n different order_col values
  • row_number exactly n rows
  • count at most n rows
like image 23
steco Avatar answered Oct 07 '22 00:10

steco