Pandas Weighted Stats

Tags:

I have a dataframe that looks like the one below.

The weight column essentially represents the frequency of each item, so that for each location the weight sum will equal to 1

Please keep in mind that this is a simplified dataset, in reality there are more than 100 columns like value

Click to copy

d = {'location': ['a', 'a', 'b', 'b'],'item': ['x', 'y', 's', 'v'], 'value': [1, 5, 3, 7], 'weight': [0.9, 0.1, 0.8, 0.2]}
df = pd.DataFrame(data=d)
df
  location item value weight
0     a     x     1     0.9
1     a     y     5     0.1
2     b     s     3     0.8
3     b     v     7     0.2

I currently have code which will compute the grouped median, standard deviation, skew and quantiles for the unweighted data, I am using the below:

Click to copy

df = df[['location','value']]

df1 = df.groupby('location').agg(['median','skew','std']).reset_index()

df2 = df.groupby('location').quantile([0.1, 0.9, 0.25, 0.75, 0.5]).unstack(level=1).reset_index()

dfs = df1.merge(df2, how = 'left', on = 'location')

And the result is the following:

Click to copy

  location   value
             median skew      std  0.1  0.9 0.25 0.75  0.5
0      a         3  NaN  2.828427  1.4  4.6  2.0  4.0  3.0
1      b         5  NaN  2.828427  3.4  6.6  4.0  6.0  5.0

I would like to produce the exact same result data frame as the one above, however with weighted statistics using the weight column. How can I go about doing this?

One more important consideration to note, there are often times where value is null but it has a weight associated to it.

638

asked Oct 31 '21 03:10

Mustard Tiger

Video Answer

2 Answers

Because the weights are frequency weights, the most accurate method is to duplicate the observations according to the weights.

Adjusting the weights

Normally, frequencies are whole numbers. However, the frequencies here merely show how frequently an item appears relative to the other items of the same group. In this case, you can multiply all the weights by a value that makes the weights integers and use that value consistently throughout the dataset.

Here is a function that helps you choose the smallest possible set of weights to minimize memory usage and returns the weights as integers.

Click to copy

def adjust(weights):
    base = 10 ** max([len(str(i).split(".")[1]) for i in weights])
    scalar = base / np.gcd.reduce((weights * base).astype(int))
    weights = weights * scalar

    return weights

You can refer to the following question to understand how this function works.

Multiply a Numpy array by a scalar to make every element an integer

Click to copy

df = pd.DataFrame({
    "location": ["a", "a", "b", "b"],
    "values": [1, 5, 3, 7],
    "weights": [0.9, 0.1, 0.8, 0.2]
})

df.loc[:, "weights"] = adjust(df["weights"])

Here are the weights after the adjustment.

Click to copy

>>> df
  location  value  weights
0        a      1      9.0
1        a      5      1.0
2        b      3      8.0
3        b      7      2.0

Duplicating the observations

After adjusting the weights, you need to duplicate the observations according to their weights.

Click to copy

df = df.loc[df.index.repeat(df["weights"])] \
    .reset_index(drop=True).drop("weights", axis=1)

You can refer to the following answer to understand how this process works.

Duplicate rows, according to value in a column

Let's count the number of observations after being duplicated.

Click to copy

>>> df.count()
location    20
values      20

Performing Statistical Operations

Now, you can use groupby and aggregate using any statistical operations. The data is now weighted.

Click to copy

df1 = df.groupby("location").agg(["median", "skew", "std"]).reset_index()
df2 = df.groupby("location").quantile([0.1, 0.9, 0.25, 0.75, 0.5]) \
    .unstack(level=1).reset_index()

print(df1.merge(df2, how="left", on="location"))

This gives the following output.

Click to copy

  location values
           median      skew       std  0.1  0.9 0.25 0.75  0.5
0        a    1.0  3.162278  1.264911  1.0  1.4  1.0  1.0  1.0
1        b    3.0  1.778781  1.686548  3.0  7.0  3.0  3.0  3.0

Interpreting the weighted statistics

Let's follow the same process above but instead of giving the weights the smallest possible value, we will gradually duplicate the weights and see the results. Because the weights are at their minimum values, the greater sets of weights will be the multiples of the current set. The following line will be changed.

Click to copy

df.loc[:, "weights"] = adjust(df["weights"])

adjust(df["weights"]) * 2

Click to copy

  location values
           median      skew       std  0.1  0.9 0.25 0.75  0.5
0        a    1.0  2.887939  1.231174  1.0  1.4  1.0  1.0  1.0
1        b    3.0  1.624466  1.641565  3.0  7.0  3.0  3.0  3.0

adjust(df["weights"]) * 3

Click to copy

  location values
           median     skew       std  0.1  0.9 0.25 0.75  0.5
0        a    1.0  2.80912  1.220514  1.0  1.4  1.0  1.0  1.0
1        b    3.0  1.58013  1.627352  3.0  7.0  3.0  3.0  3.0

adjust(df["weights"]) * 4

Click to copy

  location values
           median      skew       std  0.1  0.9 0.25 0.75  0.5
0        a    1.0  2.771708  1.215287  1.0  1.4  1.0  1.0  1.0
1        b    3.0  1.559086  1.620383  3.0  7.0  3.0  3.0  3.0

Repeat this process several times and we will get the following graph. The statistics in this graph are not split into groups and there are some other statistics added to it for demonstration purposes.

Comparing different statistics

Some statistics like sample mean, median, and quantiles are always constant no matter how many times we duplicate the observations.

Some statistics, on the other hand, give different results depending on how many duplications we make. Let's call them inconsistent statistics for now.

There are two types of inconsistent statistics.

Inconsistent statistics that are independent of the sample size

For example: any statistical moments (mean, variance, standard deviation, skewness, kurtosis)

Independent here does not mean "not having the sample size in the equation". Notice how sample mean also has the sample size in its equation but it is still independent of the sample size.

For these types of statistics, you cannot compute the exact values because the answer may vary on different sample sizes. However, you can conclude, for example, the standard deviation of Group A is generally higher than the standard deviation of Group B.
Inconsistent statistics that are dependent on the sample size

For example: standard error of the mean and sum

Standard error, however, depends on the sample size. Let's have a look at its equation.

We can view standard error as the standard deviation per the square root of the sample size and therefore it is dependent on the sample size. Sum is also dependent on the sample size.

For these types of statistics, we cannot conclude anything because we are missing an important piece of information: the sample size.

198

answered Oct 18 '22 07:10

Troll

Instead of merging two groupby operations, use named aggregation after weighting the values:

Generate weighted values using assign.
Aggregate using {output_col: (input_col, agg_function), ...}.

Click to copy

dfs = df.assign(weighted=df.value * df.weight).groupby('location').agg(**{
    'median': ('weighted', 'median'),
    'skew': ('weighted', 'skew'),
    'std': ('weighted', 'std'),
    '0.1': ('weighted', lambda x: x.quantile(0.1)),
    '0.9': ('weighted', lambda x: x.quantile(0.9)),
    '0.25': ('weighted', lambda x: x.quantile(0.25)),
    '0.75': ('weighted', lambda x: x.quantile(0.75)),
    '0.5': ('weighted', lambda x: x.quantile(0.5)),
})

Output:

Click to copy

          median  skew       std   0.1   0.9  0.25  0.75  0.5
location                                                     
a            0.7   NaN  0.282843  0.54  0.86  0.60  0.80  0.7
b            1.9   NaN  0.707107  1.50  2.30  1.65  2.15  1.9

answered Oct 18 '22 07:10

tdy

Related questions
                            
                                Cannot use keras models on Mac M1 with BigSur
                            
                                Permission denied with pip install --user -e /home/me/package/
                            
                                Convert result from groupby on multiple columns to list of dictionaries
                            
                                How do I tell sympy that i^2 = -1?
                            
                                Multiple Hidden Imports in Pyinstaller
                            
                                Pandas group by and sum, but create a new row when a certain amount is exceeded
                            
                                Building wheel for cffi (setup.py) ... error while installing the packages from requirements.txt in django
                            
                                Looping tasks in Prefect
                            
                                pip install py-find-1st fails on ubuntu20 & centos with python3.9
                            
                                Create hierarchy column in pandas
                            
                                `multiprocessing.Process` are modifying non-shared variables they should not have access to
                            
                                Troubles with downloading and saving a document in django
                            
                                Find indices for elements in array B best matching those in array A
                            
                                How to rename the first column of a pandas dataframe?
                            
                                Pandas: reading multi-index JSON as pandas data frame
                            
                                What is the best way to get accurate text similarity in python for comparing single words or bigrams?
                            
                                Why does mypy not accept a list[str] as a list[Optional[str]]?
                            
                                Allowing only single active session per user in Django app
                            
                                Is there any python function/library for calculate binomial confidence intervals?
                            
                                Retry on deadlock for MySQL / SQLAlchemy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas Weighted Stats

Tags:

python

pandas

dataframe

mean

weighted-average

Mustard Tiger

People also ask

Video Answer

2 Answers

Adjusting the weights

Duplicating the observations

Performing Statistical Operations

Interpreting the weighted statistics

Troll

tdy

Recent Activity

Donate For Us