Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: How to create weighted quantiles in Pandas?

Tags:

python

pandas

I understand how to create simple quantiles in Pandas using pd.qcut. But after searching around, I don't see anything to create weighted quantiles. Specifically, I wish to create a variable which bins the values of a variable of interest (from smallest to largest) such that each bin contains an equal weight. So far this is what I have:

def wtdQuantile(dataframe, var, weight = None, n = 10):
    if weight == None:
        return pd.qcut(dataframe[var], n, labels = False)
    else:
        dataframe.sort_values(var, ascending = True, inplace = True)
        cum_sum = dataframe[weight].cumsum()
        cutoff = max(cum_sum)/n
        quantile = cum_sum/cutoff
        quantile[-1:] -= 1
        return quantile.map(int)

Is there an easier way, or something prebuilt from Pandas that I'm missing?

Edit: As requested, I'm providing some sample data. In the following, I'm trying to bin the "Var" variable using "Weight" as the weight. Using pd.qcut, we get an equal number of observations in each bin. Instead, I want an equal weight in each bin, or in this case, as close to equal as possible.

Weight  Var  pd.qcut(n=5)  Desired_Rslt
   10     1            0              0
   14     2            0              0
   18     3            1              0
   15     4            1              1
   30     5            2              1
   12     6            2              2
   20     7            3              2
   25     8            3              3
   29     9            4              3
   45    10            4              4
like image 490
AdmiralWen Avatar asked Aug 06 '17 02:08

AdmiralWen


People also ask

How do you get Quantiles in pandas?

Pandas DataFrame quantile() Method The quantile() method calculates the quantile of the values in a given axis. Default axis is row. By specifying the column axis ( axis='columns' ), the quantile() method calculates the quantile column-wise and returns the mean value for each row.

How do you get quantiles of data in Python?

In Python, the numpy. quantile() function takes an array and a number say q between 0 and 1. It returns the value at the q th quantile.

How do you find the 25th percentile in pandas?

To find percentiles of a numeric column in a DataFrame, or the percentiles of a Series in pandas, the easiest way is to use the pandas quantile() function. You can also use the numpy percentile() function.


1 Answers

I don't think this is built-in to Pandas, but here is a function that does what you want in a few lines:

import numpy as np
import pandas as pd
from pandas._libs.lib import is_integer

def weighted_qcut(values, weights, q, **kwargs):
    'Return weighted quantile cuts from a given series, values.'
    if is_integer(q):
        quantiles = np.linspace(0, 1, q + 1)
    else:
        quantiles = q
    order = weights.iloc[values.argsort()].cumsum()
    bins = pd.cut(order / order.iloc[-1], quantiles, **kwargs)
    return bins.sort_index()

We can test it on your data this way:

data = pd.DataFrame({
    'var': range(1, 11),
    'weight': [10, 14, 18, 15, 30, 12, 20, 25, 29, 45]
})

data['qcut'] = pd.qcut(data['var'], 5, labels=False)
data['weighted_qcut'] = weighted_qcut(data['var'], data['weight'], 5, labels=False)
print(data)

The output matches your desired result from above:

   var  weight  qcut  weighted_qcut
0    1      10     0              0
1    2      14     0              0
2    3      18     1              0
3    4      15     1              1
4    5      30     2              1
5    6      12     2              2
6    7      20     3              2
7    8      25     3              3
8    9      29     4              3
9   10      45     4              4
like image 84
jakevdp Avatar answered Nov 07 '22 08:11

jakevdp