I have a pandas dataframe (call it data) with categorical and continuous values that look like this:
INDEX AGE SEX INCOME COUNTRY INSTANCE_WEIGHT
1 25 M 30000 USA 120
2 53 F 42000 FR 95
3 37 F 22000 USA 140
4 18 M 0 FR 110
.
.
.
15000 29 F 39000 USA 200
The instance weight indicates the number of people in the population that each record represents due to stratified sampling.
What I would like to do is plotting the distribution of each of the variable into an histogram. The problem is that I can't just plot an histogram of this current dataframe since it's not representative of the real distribution. To be representative, I have to multiply each row by its intance_weight before plotting it. The problem sounds easy but I can't find a good way of doing that.
A solution would be to duplicate each row instance_weight
times but the real dataframe is 300k rows and instance_weight
is around 1000.
This is the code I have for now to plot an histogram of each of the column.
fig = plt.figure(figsize=(20,70))
cols = 4
rows = ceil(float(data.shape[1]) / cols)
for i, column in enumerate(data.drop(["instance_weight","index"], axis=1).columns):
ax = fig.add_subplot(rows, cols, i + 1)
ax.set_title(column)
# Check if data categorical or not
if data.dtypes[column] == np.object:
data[column].value_counts().plot(kind="bar", axes=ax,
alpha=0.8, color=sns.color_palette(n_colors=1))
else:
data[column].hist(axes=ax, alpha=0.8)
plt.xticks(rotation="vertical")
plt.subplots_adjust(hspace=1, wspace=0.2)
How to consider the weight now?
You should use the 'weights' argument of the matplotlib 'hist' function, which is also available through the pandas 'plot' function.
In your example, to plot the distribution of the variable 'AGE' weighted on the variable 'INSTANCE_WEIGHT', you should do:
df["AGE"].plot(kind="hist", weights=df["INSTANCE_WEIGHT"])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With