I have a dataframe that looks like this:
Out[14]:
impwealth indweight
16 180000 34.200
21 384000 37.800
26 342000 39.715
30 1154000 44.375
31 421300 44.375
32 1210000 45.295
33 1062500 45.295
34 1878000 46.653
35 876000 46.653
36 925000 53.476
I want to calculate the weighted median of the column impwealth
using the frequency weights in indweight
. My pseudo code looks like this:
# Sort `impwealth` in ascending order
df.sort('impwealth', 'inplace'=True)
# Find the 50th percentile weight, P
P = df['indweight'].sum() * (.5)
# Search for the first occurrence of `impweight` that is greater than P
i = df.loc[df['indweight'] > P, 'indweight'].last_valid_index()
# The value of `impwealth` associated with this index will be the weighted median
w_median = df.ix[i, 'impwealth']
This method seems clunky, and I'm not sure it's correct. I didn't find a built in way to do this in pandas reference. What is the best way to go about finding weighted median?
Calculate a Weighted Average in Pandas Using NumpyThe numpy library has a function, average() , which allows us to pass in an optional argument to specify weights of values. The function will take an array into the argument a= , and another array for weights under the argument weights= .
For a weighted median we change how the middle is found; instead of finding the middle value we are looking for the middle weight and then the median is the associated value for that weight. Here's a very high-level algorithm: Sort the values. Add up the weights for the values in order (i.e. a running sum of weight).
If you want to see the median, you can use df. describe(). The 50% value is the median.
If the total number of occurrences (let's call it 'n', i.e. the sum of the frequencies / the total number of students) is odd, then the median is the ((n+1) / 2)-th value. If n is even, then the median is the average of the (n/2)-th and the ((n/2) + 1)-th value.
If you want to do this in pure pandas, here's a way. It does not interpolate either. (@svenkatesh, you were missing the cumulative sum in your pseudocode)
df.sort_values('impwealth', inplace=True)
cumsum = df.indweight.cumsum()
cutoff = df.indweight.sum() / 2.0
median = df.impwealth[cumsum >= cutoff].iloc[0]
This gives a median of 925000.
Have you tried the wquantiles package? I had never used it before, but it has a weighted median function that seems to give at least a reasonable answer (you'll probably want to double check that it's using the approach you expect).
In [12]: import weighted
In [13]: weighted.median(df['impwealth'], df['indweight'])
Out[13]: 914662.0859091772
You can use this solution to Weighted percentile using numpy:
def weighted_quantile(values, quantiles, sample_weight=None,
values_sorted=False, old_style=False):
""" Very close to numpy.percentile, but supports weights.
NOTE: quantiles should be in [0, 1]!
:param values: numpy.array with data
:param quantiles: array-like with many quantiles needed
:param sample_weight: array-like of the same length as `array`
:param values_sorted: bool, if True, then will avoid sorting of
initial array
:param old_style: if True, will correct output to be consistent
with numpy.percentile.
:return: numpy.array with computed quantiles.
"""
values = np.array(values)
quantiles = np.array(quantiles)
if sample_weight is None:
sample_weight = np.ones(len(values))
sample_weight = np.array(sample_weight)
assert np.all(quantiles >= 0) and np.all(quantiles <= 1), \
'quantiles should be in [0, 1]'
if not values_sorted:
sorter = np.argsort(values)
values = values[sorter]
sample_weight = sample_weight[sorter]
weighted_quantiles = np.cumsum(sample_weight) - 0.5 * sample_weight
if old_style:
# To be convenient with numpy.percentile
weighted_quantiles -= weighted_quantiles[0]
weighted_quantiles /= weighted_quantiles[-1]
else:
weighted_quantiles /= np.sum(sample_weight)
return np.interp(quantiles, weighted_quantiles, values)
Call as weighted_quantile(df.impwealth, quantiles=0.5, df.indweight)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With