Difference in dense rank and row number in spark

2 Answers

The difference is when there are "ties" in the ordering column. Check the example below:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val df = Seq(("a", 10), ("a", 10), ("a", 20)).toDF("col1", "col2")

val windowSpec = Window.partitionBy("col1").orderBy("col2")

df
  .withColumn("rank", rank().over(windowSpec))
  .withColumn("dense_rank", dense_rank().over(windowSpec))
  .withColumn("row_number", row_number().over(windowSpec)).show

+----+----+----+----------+----------+
|col1|col2|rank|dense_rank|row_number|
+----+----+----+----------+----------+
|   a|  10|   1|         1|         1|
|   a|  10|   1|         1|         2|
|   a|  20|   3|         2|         3|
+----+----+----+----------+----------+

Note that the value "10" exists twice in col2 within the same window (col1 = "a"). That's when you see a difference between the three functions.

139

answered Oct 07 '22 00:10

Daniel de Paula

I'm showing @Daniel's answer in Python and I'm adding a comparison with count('*') that can be used if you want to get top-n at most rows per group.

from pyspark.sql.session import SparkSession
from pyspark.sql import Window
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    ['a', 10], ['a', 20], ['a', 30],
    ['a', 40], ['a', 40], ['a', 40], ['a', 40],
    ['a', 50], ['a', 50], ['a', 60]], ['part_col', 'order_col'])

window = Window.partitionBy("part_col").orderBy("order_col")

df = (df
  .withColumn("rank", F.rank().over(window))
  .withColumn("dense_rank", F.dense_rank().over(window))
  .withColumn("row_number", F.row_number().over(window))
  .withColumn("count", F.count('*').over(window))
)

df.show()

+--------+---------+----+----------+----------+-----+
|part_col|order_col|rank|dense_rank|row_number|count|
+--------+---------+----+----------+----------+-----+
|       a|       10|   1|         1|         1|    1|
|       a|       20|   2|         2|         2|    2|
|       a|       30|   3|         3|         3|    3|
|       a|       40|   4|         4|         4|    7|
|       a|       40|   4|         4|         5|    7|
|       a|       40|   4|         4|         6|    7|
|       a|       40|   4|         4|         7|    7|
|       a|       50|   8|         5|         8|    9|
|       a|       50|   8|         5|         9|    9|
|       a|       60|  10|         6|        10|   10|
+--------+---------+----+----------+----------+-----+

For example if you want to take at most 4 without randomly picking one of the 4 "40" of the sorting column:

df.where("count <= 4").show()

+--------+---------+----+----------+----------+-----+
|part_col|order_col|rank|dense_rank|row_number|count|
+--------+---------+----+----------+----------+-----+
|       a|       10|   1|         1|         1|    1|
|       a|       20|   2|         2|         2|    2|
|       a|       30|   3|         3|         3|    3|
+--------+---------+----+----------+----------+-----+

In summary, if you filter <= n those columns you will get:

rank at least n rows
dense_rank at least n different order_col values
row_number exactly n rows
count at most n rows

answered Oct 07 '22 00:10

steco

Related questions
                            
                                How to know what is the reason for ClosedChannelExceptions with spark-shell in YARN client mode?
                            
                                How do I add an persistent column of row ids to Spark DataFrame?
                            
                                Pyspark: repartition vs partitionBy
                            
                                How to log using log4j to local file system inside a Spark application that runs on YARN?
                            
                                Perform a typed join in Scala with Spark Datasets
                            
                                Require kryo serialization in Spark (Scala)
                            
                                datetime range filter in PySpark SQL
                            
                                DataFrame / Dataset groupBy behaviour/optimization
                            
                                How to change memory per node for apache spark worker
                            
                                Change Executor Memory (and other configs) for Spark Shell
                            
                                How to convert List to JavaRDD
                            
                                Dealing with unbalanced datasets in Spark MLlib
                            
                                Spark DataFrame - Select n random rows
                            
                                How to create SparkSession from existing SparkContext
                            
                                How to sort an RDD in Scala Spark?
                            
                                map vs mapValues in Spark
                            
                                How do I use multiple conditions with pyspark.sql.functions.when()?
                            
                                Replace empty strings with None/null values in DataFrame
                            
                                Increase memory available to PySpark at runtime
                            
                                how to convert json string to dataframe on spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference in dense rank and row number in spark

Tags:

apache-spark

John

People also ask

2 Answers

Daniel de Paula

steco

Recent Activity

Donate For Us