Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Weighted Stats

I have a dataframe that looks like the one below.

The weight column essentially represents the frequency of each item, so that for each location the weight sum will equal to 1

Please keep in mind that this is a simplified dataset, in reality there are more than 100 columns like value

d = {'location': ['a', 'a', 'b', 'b'],'item': ['x', 'y', 's', 'v'], 'value': [1, 5, 3, 7], 'weight': [0.9, 0.1, 0.8, 0.2]}
df = pd.DataFrame(data=d)
df
  location item value weight
0     a     x     1     0.9
1     a     y     5     0.1
2     b     s     3     0.8
3     b     v     7     0.2

I currently have code which will compute the grouped median, standard deviation, skew and quantiles for the unweighted data, I am using the below:

df = df[['location','value']]

df1 = df.groupby('location').agg(['median','skew','std']).reset_index()

df2 = df.groupby('location').quantile([0.1, 0.9, 0.25, 0.75, 0.5]).unstack(level=1).reset_index()

dfs = df1.merge(df2, how = 'left', on = 'location')

And the result is the following:

  location   value
             median skew      std  0.1  0.9 0.25 0.75  0.5
0      a         3  NaN  2.828427  1.4  4.6  2.0  4.0  3.0
1      b         5  NaN  2.828427  3.4  6.6  4.0  6.0  5.0

I would like to produce the exact same result data frame as the one above, however with weighted statistics using the weight column. How can I go about doing this?

One more important consideration to note, there are often times where value is null but it has a weight associated to it.

like image 638
Mustard Tiger Avatar asked Oct 31 '21 03:10

Mustard Tiger


People also ask

How do you do a weighted average on pandas?

Calculate a Weighted Average in Pandas Using NumpyThe numpy library has a function, average() , which allows us to pass in an optional argument to specify weights of values. The function will take an array into the argument a= , and another array for weights under the argument weights= .

What does .values do in pandas?

The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.

How to calculate weighted average in pandas?

You can use the following function to calculate a weighted average in Pandas: def w_avg (df, values, weights): d = df [values] w = df [weights] return (d * w).sum() / w.sum() The following examples show how to use this syntax in practice. Example 1: Weighted Average in Pandas

What is the use of pandas-weighting?

pandas-weighting enables general level weighting (similar to spss) of dataframes. This makes it possible to calculate weighted means, frequencies etc. statistical figures without the need to write separate functions for applying weighting.

How to calculate exponentially weighted statistics in Dataframe?

pandas ewm - Calculate Exponentially Weighted Statistics in DataFrame To calculate exponentially weighted statistics (like averages) in a pandas DataFrame, the easiest way is to use the pandas ewm() function. Skip to primary navigation

How to calculate exponential moving averages in pandas?

To calculate exponential moving averages in pandas, we can use the pandas ewm()function. df.ewm(span=10, adjust=False).mean() # Calculate the Exponential Weighted Moving Average over a span of 10 periods When working with data, many times we want to calculate summary statistics to understand our data better.


Video Answer


2 Answers

Because the weights are frequency weights, the most accurate method is to duplicate the observations according to the weights.

Adjusting the weights

Normally, frequencies are whole numbers. However, the frequencies here merely show how frequently an item appears relative to the other items of the same group. In this case, you can multiply all the weights by a value that makes the weights integers and use that value consistently throughout the dataset.

Here is a function that helps you choose the smallest possible set of weights to minimize memory usage and returns the weights as integers.

def adjust(weights):
    base = 10 ** max([len(str(i).split(".")[1]) for i in weights])
    scalar = base / np.gcd.reduce((weights * base).astype(int))
    weights = weights * scalar

    return weights

You can refer to the following question to understand how this function works.

  • Multiply a Numpy array by a scalar to make every element an integer
df = pd.DataFrame({
    "location": ["a", "a", "b", "b"],
    "values": [1, 5, 3, 7],
    "weights": [0.9, 0.1, 0.8, 0.2]
})

df.loc[:, "weights"] = adjust(df["weights"])

Here are the weights after the adjustment.

>>> df
  location  value  weights
0        a      1      9.0
1        a      5      1.0
2        b      3      8.0
3        b      7      2.0

Duplicating the observations

After adjusting the weights, you need to duplicate the observations according to their weights.

df = df.loc[df.index.repeat(df["weights"])] \
    .reset_index(drop=True).drop("weights", axis=1)

You can refer to the following answer to understand how this process works.

  • Duplicate rows, according to value in a column

Let's count the number of observations after being duplicated.

>>> df.count()
location    20
values      20

Performing Statistical Operations

Now, you can use groupby and aggregate using any statistical operations. The data is now weighted.

df1 = df.groupby("location").agg(["median", "skew", "std"]).reset_index()
df2 = df.groupby("location").quantile([0.1, 0.9, 0.25, 0.75, 0.5]) \
    .unstack(level=1).reset_index()

print(df1.merge(df2, how="left", on="location"))

This gives the following output.

  location values
           median      skew       std  0.1  0.9 0.25 0.75  0.5
0        a    1.0  3.162278  1.264911  1.0  1.4  1.0  1.0  1.0
1        b    3.0  1.778781  1.686548  3.0  7.0  3.0  3.0  3.0

Interpreting the weighted statistics

Let's follow the same process above but instead of giving the weights the smallest possible value, we will gradually duplicate the weights and see the results. Because the weights are at their minimum values, the greater sets of weights will be the multiples of the current set. The following line will be changed.

df.loc[:, "weights"] = adjust(df["weights"])
  • adjust(df["weights"]) * 2

      location values
               median      skew       std  0.1  0.9 0.25 0.75  0.5
    0        a    1.0  2.887939  1.231174  1.0  1.4  1.0  1.0  1.0
    1        b    3.0  1.624466  1.641565  3.0  7.0  3.0  3.0  3.0
    
  • adjust(df["weights"]) * 3

      location values
               median     skew       std  0.1  0.9 0.25 0.75  0.5
    0        a    1.0  2.80912  1.220514  1.0  1.4  1.0  1.0  1.0
    1        b    3.0  1.58013  1.627352  3.0  7.0  3.0  3.0  3.0
    
  • adjust(df["weights"]) * 4

      location values
               median      skew       std  0.1  0.9 0.25 0.75  0.5
    0        a    1.0  2.771708  1.215287  1.0  1.4  1.0  1.0  1.0
    1        b    3.0  1.559086  1.620383  3.0  7.0  3.0  3.0  3.0
    

Repeat this process several times and we will get the following graph. The statistics in this graph are not split into groups and there are some other statistics added to it for demonstration purposes.

Comparing different statistics

Some statistics like sample mean, median, and quantiles are always constant no matter how many times we duplicate the observations.

Some statistics, on the other hand, give different results depending on how many duplications we make. Let's call them inconsistent statistics for now.

There are two types of inconsistent statistics.

  1. Inconsistent statistics that are independent of the sample size

    For example: any statistical moments (mean, variance, standard deviation, skewness, kurtosis)

    Independent here does not mean "not having the sample size in the equation". Notice how sample mean also has the sample size in its equation but it is still independent of the sample size.

    For these types of statistics, you cannot compute the exact values because the answer may vary on different sample sizes. However, you can conclude, for example, the standard deviation of Group A is generally higher than the standard deviation of Group B.

  2. Inconsistent statistics that are dependent on the sample size

    For example: standard error of the mean and sum

    Standard error, however, depends on the sample size. Let's have a look at its equation.

    SEM Equation

    We can view standard error as the standard deviation per the square root of the sample size and therefore it is dependent on the sample size. Sum is also dependent on the sample size.

    For these types of statistics, we cannot conclude anything because we are missing an important piece of information: the sample size.

like image 198
Troll Avatar answered Oct 18 '22 07:10

Troll


Instead of merging two groupby operations, use named aggregation after weighting the values:

  1. Generate weighted values using assign.
  2. Aggregate using {output_col: (input_col, agg_function), ...}.
dfs = df.assign(weighted=df.value * df.weight).groupby('location').agg(**{
    'median': ('weighted', 'median'),
    'skew': ('weighted', 'skew'),
    'std': ('weighted', 'std'),
    '0.1': ('weighted', lambda x: x.quantile(0.1)),
    '0.9': ('weighted', lambda x: x.quantile(0.9)),
    '0.25': ('weighted', lambda x: x.quantile(0.25)),
    '0.75': ('weighted', lambda x: x.quantile(0.75)),
    '0.5': ('weighted', lambda x: x.quantile(0.5)),
})

Output:

          median  skew       std   0.1   0.9  0.25  0.75  0.5
location                                                     
a            0.7   NaN  0.282843  0.54  0.86  0.60  0.80  0.7
b            1.9   NaN  0.707107  1.50  2.30  1.65  2.15  1.9
like image 42
tdy Avatar answered Oct 18 '22 07:10

tdy