Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the same percent_rank in SQL and pandas?

I was learning pyspark which uses HiveQL and found it interesting that the percent rank gives two different answers for pyspark-sql and pandas.

Question Source with sql code: https://www.windowfunctions.com/questions/ranking/3

How to get the same result as SQL in pandas?

Two Questions

  • What is the python code that gives same result as SQL?
  • What is the SQL code that gives the same result as pandas?

pyspark-sql

q = """
select name, weight,
       percent_rank() over (order by weight) as percent_rank_wt
from cats
order by weight
"""
spark.sql(q).show()

SQL gives this table. I would like same table using pandas.

+-------+------+-------------------+
|   name|weight|    percent_rank_wt|
+-------+------+-------------------+
| Tigger|   3.8|                0.0|
|  Molly|   4.2|0.09090909090909091|
|  Ashes|   4.5|0.18181818181818182|
|Charlie|   4.8| 0.2727272727272727|
| Smudge|   4.9|0.36363636363636365|
|  Felix|   5.0|0.45454545454545453|
|   Puss|   5.1| 0.5454545454545454|
| Millie|   5.4| 0.6363636363636364|
|  Alfie|   5.5| 0.7272727272727273|
|  Misty|   5.7| 0.8181818181818182|
|  Oscar|   6.1| 0.9090909090909091|
| Smokey|   6.1| 0.9090909090909091|
+-------+------+-------------------+

pandas

methods = {'average', 'min', 'max', 'first', 'dense'}

df[['name','weight']].sort_values('weight').assign(
     pct_avg=df['weight'].rank(pct=True,method='average'),
     pct_min=df['weight'].rank(pct=True,method='min'),
     pct_max=df['weight'].rank(pct=True,method='max'),
     pct_first=df['weight'].rank(pct=True,method='first'),
     pct_dense=df['weight'].rank(pct=True,method='dense')
).sort_values('weight')
       name  weight   pct_avg   pct_min   pct_max  pct_first  pct_dense
4    Tigger     3.8  0.083333  0.083333  0.083333   0.083333   0.090909
0     Molly     4.2  0.166667  0.166667  0.166667   0.166667   0.181818
1     Ashes     4.5  0.250000  0.250000  0.250000   0.250000   0.272727
11  Charlie     4.8  0.333333  0.333333  0.333333   0.333333   0.363636
3    Smudge     4.9  0.416667  0.416667  0.416667   0.416667   0.454545
2     Felix     5.0  0.500000  0.500000  0.500000   0.500000   0.545455
9      Puss     5.1  0.583333  0.583333  0.583333   0.583333   0.636364
7    Millie     5.4  0.666667  0.666667  0.666667   0.666667   0.727273
5     Alfie     5.5  0.750000  0.750000  0.750000   0.750000   0.818182
8     Misty     5.7  0.833333  0.833333  0.833333   0.833333   0.909091
6     Oscar     6.1  0.958333  0.916667  1.000000   0.916667   1.000000
10   Smokey     6.1  0.958333  0.916667  1.000000   1.000000   1.000000

setup

import numpy as np
import pandas as pd

import pyspark
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark import SparkConf, SparkContext, SQLContext
spark = pyspark.sql.SparkSession.builder.appName('app').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

df = pd.DataFrame({
    'name': [
        'Molly', 'Ashes', 'Felix', 'Smudge', 'Tigger', 'Alfie', 'Oscar',
        'Millie', 'Misty', 'Puss', 'Smokey', 'Charlie'
    ],
    'breed': [
        'Persian', 'Persian', 'Persian', 'British Shorthair',
        'British Shorthair', 'Siamese', 'Siamese', 'Maine Coon', 'Maine Coon',
        'Maine Coon', 'Maine Coon', 'British Shorthair'
    ],
    'weight': [4.2, 4.5, 5.0, 4.9, 3.8, 5.5, 6.1, 5.4, 5.7, 5.1, 6.1, 4.8],
    'color': [
        'Black', 'Black', 'Tortoiseshell', 'Black', 'Tortoiseshell', 'Brown',
        'Black', 'Tortoiseshell', 'Brown', 'Tortoiseshell', 'Brown', 'Black'
    ],
    'age': [1, 5, 2, 4, 2, 5, 1, 5, 2, 2, 4, 4]
})

schema = StructType([
    StructField('name', StringType(), True),
    StructField('breed', StringType(), True),
    StructField('weight', DoubleType(), True),
    StructField('color', StringType(), True),
    StructField('age', IntegerType(), True),
])

sdf = sqlContext.createDataFrame(df, schema)
sdf.createOrReplaceTempView("cats")
like image 245
BhishanPoudel Avatar asked Jul 24 '20 16:07

BhishanPoudel


People also ask

How do you rank percentage in SQL?

PERCENT_RANK calculate the relative rank of a row within a group of rows. We use the sort data in ascending or descending order using order by clause. We can partition data using the PARTITION BY clause, and it divides data into specific small data sets. A NULL value also gets percent_rank value zero in the output.

How do you find your percentile rank in pandas?

Let us see how to find the percentile rank of a column in a Pandas DataFrame. We will use the rank() function with the argument pct = True to find the percentile rank.

What is the use of Percent_rank?

The PERCENT_RANK function computes the rank of the employee's salary within a department as a percentage. The PARTITION BY clause is specified to partition the rows in the result set by department. The ORDER BY clause in the OVER clause orders the rows in each partition.

Which of the following returns the relative rank of a value within a group of values?

Returns the relative rank of a value within a group of values, specified as a percentage ranging from 0.0 to 1.0.

What is rank in pandas Dataframe?

pandas.DataFrame.rank ¶. pandas.DataFrame.rank. ¶. Compute numerical data ranks (1 through n) along axis. By default, equal values are assigned a rank that is the average of the ranks of those values. Index to direct ranking. dense: like ‘min’, but rank always increases by 1 between groups.

What is percent rank in SQL Server?

PERCENT_RANK (Transact-SQL) Calculates the relative rank of a row within a group of rows in SQL Server. Use PERCENT_RANK to evaluate the relative standing of a value within a query result set or partition. PERCENT_RANK is similar to the CUME_DIST function.

How do you rank rows in SQL with ranks?

Code language: SQL (Structured Query Language) (sql) The following statement uses the RANK() function to assign ranks to the rows of the result set: SELECT col , RANK () OVER ( ORDER BY col ) myrank FROM t;

What is the difference between rank and total_rows in SQL?

In this formula, rank is the rank of the row. total_rows is the number of rows that are being evaluated. Based on this formula, the PERCENT_RANK () function always returns zero for the first row the result set.


Video Answer


1 Answers

SQL's percent_rank is not exactly the same as pandas' rank. There are mainly a couple of differences:

  • SQL's percent_rank excludes the current row from the calculations. So if the table has 11 rows, for each row it will calculate the result using only the other 10 rows. pandas rank includes all the rows.
  • SQL's percent_rank gives how many rows are strictly smaller than the current one. pandas rank doesn't support a method to do that.

What is the python code that gives same result as SQL?

To get the equivalent of SQL's percent_rank in pandas, you can actually perform a small calculation on the rank results:

(df['weight'].rank(method='min')-1) / (len(df['weight'])-1)

the -1 in the numerator is to get the number of rows strictly smaller than the current row, and the -1 in the denominator is to get compute the results excluding the current row.

What is the SQL code that gives the same result as pandas?

it depends on what method you are using in pandas rank, but you probably need SQL's cume_dist.

like image 118
Qusai Alothman Avatar answered Sep 22 '22 08:09

Qusai Alothman