Say I have a dataframe as such
category1 category2 other_col another_col ....
a 1
a 2
a 2
a 3
a 3
a 1
b 10
b 10
b 10
b 11
b 11
b 11
I want to obtain a sample from my dataframe so that category1
a uniform number of times. I'm assuming that there are an equal number of each type in category1
. I know that this can be done with pandas using pandas.sample()
. However, I also want to ensure that that sample I select has category2
equally represented as well. So, for example, if I have a sample size of 5, I would want something such as:
a 1
a 2
b 10
b 11
b 10
I would not want something such as:
a 1
a 1
b 10
b 10
b 10
While this is a valid random sample of n=4
, it would not meet my requirements as I want to vary as much as possible the types of category2
.
Notice that in the first example, because a
was only sampled twice, that 3
was not not represented from category2
. This is okay. The goal is to just as uniformly as possible, represent that sample data.
If it helps to provide a clearer example, one could thing having the categories fruit
, vegetables
, meat
, grains
, junk
. In a sample size of 10, I would want as much as possible to represent each category. So ideally, 2 of each. Then each of those 2 selected rows belonging to the chosen categories would have subcategories that are also represented as uniformly as possible. So, for example, fruit could have a subcategories of red_fruits, yellow_fruits, etc. For the 2 fruit categories that are selected of the 10, red_fruits and yellow_fruits would both represented in the sample. Of course, if we had larger sample size, we would include more of the subcategories of fruit (green_fruits, blue_fruits, etc.).
This is straightforward when you use the weights keyword in df.sample
:
>>> df.sample(n = 5, weights = (df['category2'].value_counts()/len(df['category2']))**-1)
output:
category1 category2
2 "a" 2
1 "a" 2
10 "b" 11
3 "a" 3
11 "b" 11
To explain, the weights look like this:
11 4.0
10 4.0
3 6.0
2 6.0
1 6.0
I just took the percentage count for each value in df['category2']
and then inverted those values, which makes for a nice uniform weight across all values in the series.
Here is a solution that does a true random sample stratified by group (won't get you equal samples every time, but does on average which is probably better from a statistical perspective anyway):
import numpy as np
import pandas as pd
def stratified_sample(df, sample_size_per_class, strat_cols):
if isinstance(strat_cols, str):
strat_cols = [strat_cols]
#make randomizer (separately, in case we need it later?)
vcs = {}
randomizer = {}
for c in strat_cols:
#calculate number of distinct classes relative to sample size
_vc = df[c].value_counts()
vcs[c] = (_vc / sample_size_per_class).round(0).astype(int).to_dict()
#randomizer will divvy up the bigger groups into chunks of size approximate to the smallest
randomizer[c] = lambda v: np.random.randint(0, vcs[c][v], size=None)
#apply randomizer
randomized_cols = []
for c in strat_cols:
stratname = '_stratified_%s' % c
randomized_cols.append(stratname)
df[stratname] = df[c].apply(randomizer[c])
#return first random case which should be approximately n_samples
return df[df[randomized_cols].max(axis=1) == 0][set(df.columns).difference(randomized_cols)]
To test it:
test = pd.DataFrame({'category1':[0,0,0,0,0,0,1,1,1,1,1,1],
'category2':[1,2,2,3,3,1,10,10,10,11,11,11]})
lens = []
for i in range(1000):
lens.append(
len(
stratified_sample(test, 3, ['category1','category2'])
)
)
print(np.mean(lens))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With