Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weights for histogram in pandas

I have a pandas dataframe (call it data) with categorical and continuous values that look like this:

INDEX  AGE  SEX  INCOME  COUNTRY  INSTANCE_WEIGHT
1      25   M    30000   USA      120
2      53   F    42000   FR       95
3      37   F    22000   USA      140
4      18   M    0       FR       110
.
.
.
15000  29  F     39000   USA      200

The instance weight indicates the number of people in the population that each record represents due to stratified sampling.

What I would like to do is plotting the distribution of each of the variable into an histogram. The problem is that I can't just plot an histogram of this current dataframe since it's not representative of the real distribution. To be representative, I have to multiply each row by its intance_weight before plotting it. The problem sounds easy but I can't find a good way of doing that.

A solution would be to duplicate each row instance_weight times but the real dataframe is 300k rows and instance_weightis around 1000.

This is the code I have for now to plot an histogram of each of the column.

fig = plt.figure(figsize=(20,70))
cols = 4
rows = ceil(float(data.shape[1]) / cols)
for i, column in enumerate(data.drop(["instance_weight","index"], axis=1).columns):
    ax = fig.add_subplot(rows, cols, i + 1)
    ax.set_title(column)
    # Check if data categorical or not
    if data.dtypes[column] == np.object:
        data[column].value_counts().plot(kind="bar", axes=ax,
                                         alpha=0.8, color=sns.color_palette(n_colors=1))
    else:
        data[column].hist(axes=ax, alpha=0.8)
        plt.xticks(rotation="vertical")
plt.subplots_adjust(hspace=1, wspace=0.2)

How to consider the weight now?

like image 995
Thomas Reynaud Avatar asked Apr 12 '16 22:04

Thomas Reynaud


1 Answers

You should use the 'weights' argument of the matplotlib 'hist' function, which is also available through the pandas 'plot' function.

In your example, to plot the distribution of the variable 'AGE' weighted on the variable 'INSTANCE_WEIGHT', you should do:

df["AGE"].plot(kind="hist", weights=df["INSTANCE_WEIGHT"])
like image 71
queise Avatar answered Oct 19 '22 16:10

queise