I have a data frame that I want to sample based on an argument num_samples.
I want to uniformly sample based on Age across quantiles.
For example, if my dataframe has 1000 rows and num_samples = .5 I would need to sample 500 rows, but 125 from each quantile.
The first few records of my dataframe looks like this:
Age x1 x2 x3
12 1 1 2
45 2 1 3
67 4 1 2
11 3 4 10
18 9 7 6
45 3 5 8
78 8 4 7
64 6 2 3
33 3 2 2
How can I do this in python/pandas?
Create a column quantile which has bin for the Age1. Then use boolean masking and resample to sample from each bin, use pd.concat to concat the samples obtained for each bin.
labels = ['q1', 'q2', 'q3', 'q4']
df['quantile'] = pd.qcut(df.Age, q = 4, labels = labels)
out = pd.concat([df[df['quantile'].eq(label)].sample(1) for label in labels])
Prints:
>>> out
Age x1 x2 x3 quantile
4 18 9 7 6 q1
8 33 3 2 2 q2
7 64 6 2 3 q3
2 67 4 1 2 q4
P.S. For sampling n samples, change sample(1) to sample(n).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With