Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using describe() with weighted data -- mean, standard deviation, median, quantiles

I'm fairly new to python and pandas (from using SAS as my workhorse analytical platform), so I apologize in advance if this has already been asked / answered. (I've searched through the documentation as well as this site searching for answer and haven't been able to find something yet.)

I've got a dataframe (called resp) containing respondent level survey data. I want to perform some basic descriptive statistics on one of the fields (called anninc [short for annual income]).

resp["anninc"].describe()

Which gives me the basic stats:

count     76310.000000
mean      43455.874862
std       33154.848314
min           0.000000
25%       20140.000000
50%       34980.000000
75%       56710.000000
max      152884.330000
dtype: float64

But there's a catch. Given how the sample was built, there was a need to weight adjust the respondent data so that not every one is deemed as "equal" when performing the analysis. I have another column in the dataframe (called tufnwgrp) that represents the weight that should be applied to each record during the analysis.

In my prior SAS life, most of the proc's have options to process data with weights like this. For example, a standard proc univariate to give the same results would look something like this:

proc univariate data=resp;
  var anninc;
  output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count;
run;

And the same analysis using weighted data would look something like this:

proc univariate data=resp;
  var anninc;
  weight tufnwgrp;
  output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count
run;

Is there a similar sort of weighting option available in pandas for methods like describe() etc?

like image 917
Chris Chapo Avatar asked Jul 17 '13 00:07

Chris Chapo


People also ask

What does the describe () method do?

The describe() method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column: count - The number of not-empty values. mean - The average (mean) value.

What does describe () in Python do?

The describe() method computes and displays summary statistics for a Python dataframe. (It also operates on dataframe columns and Pandas series objects.)

How do you find the median of weighted data?

If the total number of occurrences (let's call it 'n', i.e. the sum of the frequencies / the total number of students) is odd, then the median is the ((n+1) / 2)-th value. If n is even, then the median is the average of the (n/2)-th and the ((n/2) + 1)-th value.

How do you calculate weighted standard deviation in Python?

The easiest way to calculate a weighted standard deviation in Python is to use the DescrStatsW() function from the statsmodels package: What is this?


1 Answers

There is statistics and econometrics library (statsmodels) that appears to handle this. Here's an example that extends @MSeifert's answer here on a similar question.

df=pd.DataFrame({ 'x':range(1,101), 'wt':range(1,101) })

from statsmodels.stats.weightstats import DescrStatsW
wdf = DescrStatsW(df.x, weights=df.wt, ddof=1) 

print( wdf.mean )
print( wdf.std )
print( wdf.quantile([0.25,0.50,0.75]) )

67.0
23.6877840059
p
0.25    50
0.50    71
0.75    87

I don't use SAS, but this gives the same answer as the stata command:

sum x [fw=wt], detail

Stata actually has a few weight options and in this case gives a slightly different answer if you specify aw (analytical weights) instead of fw (frequency weights). Also, stata requires fw to be an integer whereas DescrStatsW allows non-integer weights. Weights are more complicated than you'd think... This is starting to get into the weeds, but there is a great discussion of weighting issues for calculating the standard deviation here.

Also note that DescrStatsW does not appear to include functions for min and max, but as long as your weights are non-zero this should not be a problem as the weights don't affect the min and max. However, if you did have some zero weights, it might be nice to have weighted min and max, but it's also easy to calculate in pandas:

df.x[ df.wt > 0 ].min()
df.x[ df.wt > 0 ].max()
like image 168
JohnE Avatar answered Oct 17 '22 05:10

JohnE