Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas sample with weights

Tags:

pandas

sample

I have df and I'd like to make some sampling from it with respect to distribution of some variable. Let's say df['type'].value_counts(normalize=True) returns:

A 0.3
B 0.5
C 0.2

I'd like to make something like sampledf = df.sample(weights=df['type'].value_counts(normalize=True)) such that sampledf ['type'].value_counts(normalize=True) will return almost the same distridution. How to pass dict with frequency here?

like image 577
Bear Avatar asked Mar 07 '19 11:03

Bear


2 Answers

Weights has to take a series of the same length as the original df, so best is to add it as a column:

df['freq'] = df.groupby('type')['type'].transform('count')
sampledf = df.sample(weights = df.freq)

Or without adding the column:

sampledf = df.sample(weights = df.groupby('type')['type'].transform('count'))
like image 79
Josh Friedlander Avatar answered Nov 08 '22 20:11

Josh Friedlander


In addition to the answer above, it should be noted that if you want to sample each type equally you should adjust your code to:

df['freq'] = 1./df.groupby('type')['type'].transform('count')
sampledf = df.sample(weights = df.freq)

In the case of two classes. If you have more than two classes, you can use the following code to generalize the weights calculation:

w_j=n_samples / (n_classes * n_samples_j)
like image 29
Richard Avatar answered Nov 08 '22 21:11

Richard