I was learning pyspark
which uses HiveQL
and found it interesting that the percent rank gives two different answers for pyspark-sql
and pandas
.
Question Source with sql code: https://www.windowfunctions.com/questions/ranking/3
How to get the same result as SQL in pandas?
q = """
select name, weight,
percent_rank() over (order by weight) as percent_rank_wt
from cats
order by weight
"""
spark.sql(q).show()
SQL gives this table. I would like same table using pandas.
+-------+------+-------------------+
| name|weight| percent_rank_wt|
+-------+------+-------------------+
| Tigger| 3.8| 0.0|
| Molly| 4.2|0.09090909090909091|
| Ashes| 4.5|0.18181818181818182|
|Charlie| 4.8| 0.2727272727272727|
| Smudge| 4.9|0.36363636363636365|
| Felix| 5.0|0.45454545454545453|
| Puss| 5.1| 0.5454545454545454|
| Millie| 5.4| 0.6363636363636364|
| Alfie| 5.5| 0.7272727272727273|
| Misty| 5.7| 0.8181818181818182|
| Oscar| 6.1| 0.9090909090909091|
| Smokey| 6.1| 0.9090909090909091|
+-------+------+-------------------+
methods = {'average', 'min', 'max', 'first', 'dense'}
df[['name','weight']].sort_values('weight').assign(
pct_avg=df['weight'].rank(pct=True,method='average'),
pct_min=df['weight'].rank(pct=True,method='min'),
pct_max=df['weight'].rank(pct=True,method='max'),
pct_first=df['weight'].rank(pct=True,method='first'),
pct_dense=df['weight'].rank(pct=True,method='dense')
).sort_values('weight')
name weight pct_avg pct_min pct_max pct_first pct_dense
4 Tigger 3.8 0.083333 0.083333 0.083333 0.083333 0.090909
0 Molly 4.2 0.166667 0.166667 0.166667 0.166667 0.181818
1 Ashes 4.5 0.250000 0.250000 0.250000 0.250000 0.272727
11 Charlie 4.8 0.333333 0.333333 0.333333 0.333333 0.363636
3 Smudge 4.9 0.416667 0.416667 0.416667 0.416667 0.454545
2 Felix 5.0 0.500000 0.500000 0.500000 0.500000 0.545455
9 Puss 5.1 0.583333 0.583333 0.583333 0.583333 0.636364
7 Millie 5.4 0.666667 0.666667 0.666667 0.666667 0.727273
5 Alfie 5.5 0.750000 0.750000 0.750000 0.750000 0.818182
8 Misty 5.7 0.833333 0.833333 0.833333 0.833333 0.909091
6 Oscar 6.1 0.958333 0.916667 1.000000 0.916667 1.000000
10 Smokey 6.1 0.958333 0.916667 1.000000 1.000000 1.000000
import numpy as np
import pandas as pd
import pyspark
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark import SparkConf, SparkContext, SQLContext
spark = pyspark.sql.SparkSession.builder.appName('app').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = pd.DataFrame({
'name': [
'Molly', 'Ashes', 'Felix', 'Smudge', 'Tigger', 'Alfie', 'Oscar',
'Millie', 'Misty', 'Puss', 'Smokey', 'Charlie'
],
'breed': [
'Persian', 'Persian', 'Persian', 'British Shorthair',
'British Shorthair', 'Siamese', 'Siamese', 'Maine Coon', 'Maine Coon',
'Maine Coon', 'Maine Coon', 'British Shorthair'
],
'weight': [4.2, 4.5, 5.0, 4.9, 3.8, 5.5, 6.1, 5.4, 5.7, 5.1, 6.1, 4.8],
'color': [
'Black', 'Black', 'Tortoiseshell', 'Black', 'Tortoiseshell', 'Brown',
'Black', 'Tortoiseshell', 'Brown', 'Tortoiseshell', 'Brown', 'Black'
],
'age': [1, 5, 2, 4, 2, 5, 1, 5, 2, 2, 4, 4]
})
schema = StructType([
StructField('name', StringType(), True),
StructField('breed', StringType(), True),
StructField('weight', DoubleType(), True),
StructField('color', StringType(), True),
StructField('age', IntegerType(), True),
])
sdf = sqlContext.createDataFrame(df, schema)
sdf.createOrReplaceTempView("cats")
PERCENT_RANK calculate the relative rank of a row within a group of rows. We use the sort data in ascending or descending order using order by clause. We can partition data using the PARTITION BY clause, and it divides data into specific small data sets. A NULL value also gets percent_rank value zero in the output.
Let us see how to find the percentile rank of a column in a Pandas DataFrame. We will use the rank() function with the argument pct = True to find the percentile rank.
The PERCENT_RANK function computes the rank of the employee's salary within a department as a percentage. The PARTITION BY clause is specified to partition the rows in the result set by department. The ORDER BY clause in the OVER clause orders the rows in each partition.
Returns the relative rank of a value within a group of values, specified as a percentage ranging from 0.0 to 1.0.
pandas.DataFrame.rank ¶. pandas.DataFrame.rank. ¶. Compute numerical data ranks (1 through n) along axis. By default, equal values are assigned a rank that is the average of the ranks of those values. Index to direct ranking. dense: like ‘min’, but rank always increases by 1 between groups.
PERCENT_RANK (Transact-SQL) Calculates the relative rank of a row within a group of rows in SQL Server. Use PERCENT_RANK to evaluate the relative standing of a value within a query result set or partition. PERCENT_RANK is similar to the CUME_DIST function.
Code language: SQL (Structured Query Language) (sql) The following statement uses the RANK() function to assign ranks to the rows of the result set: SELECT col , RANK () OVER ( ORDER BY col ) myrank FROM t;
In this formula, rank is the rank of the row. total_rows is the number of rows that are being evaluated. Based on this formula, the PERCENT_RANK () function always returns zero for the first row the result set.
SQL's percent_rank
is not exactly the same as pandas' rank
. There are mainly a couple of differences:
percent_rank
excludes the current row from the calculations. So if the table has 11 rows, for each row it will calculate the result using only the other 10 rows. pandas rank
includes all the rows.percent_rank
gives how many rows are strictly smaller than the current one. pandas rank
doesn't support a method to do that.To get the equivalent of SQL's percent_rank
in pandas, you can actually perform a small calculation on the rank
results:
(df['weight'].rank(method='min')-1) / (len(df['weight'])-1)
the -1
in the numerator is to get the number of rows strictly smaller than the current row, and the -1
in the denominator is to get compute the results excluding the current row.
it depends on what method you are using in pandas rank
, but you probably need SQL's cume_dist
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With