Given a dataset like this:
import pandas as pd
rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60},
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40},
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11},
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10},
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3},
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1},
{'key': 'HAHA', 'freq': 1}]
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
[out]:
key freq percent
0 ABC 100 0.328947
1 DEF 60 0.197368
2 GHI 50 0.164474
3 JKL 40 0.131579
4 MNO 13 0.042763
5 PQR 11 0.036184
6 STU 10 0.032895
7 VWX 10 0.032895
8 YZZ 3 0.009868
9 WHYQ 3 0.009868
10 HOWEE 2 0.006579
11 DUH 1 0.003289
12 HAHA 1 0.003289
The goal is to
In this case, the answer that fits are:
['ABC', 'DEF']
['GHI', 'JKL', 'MNO', 'PQR']
['VWX', 'STU', 'YZZ', 'WHYQ', 'HOWEE', 'HAHA', 'DUH']
I've tried this:
import random
import pandas as pd
rows = [{'key': 'ABC', 'freq': 100}, {'key': 'DEF', 'freq': 60},
{'key': 'GHI', 'freq': 50}, {'key': 'JKL', 'freq': 40},
{'key': 'MNO', 'freq': 13}, {'key': 'PQR', 'freq': 11},
{'key': 'STU', 'freq': 10}, {'key': 'VWX', 'freq': 10},
{'key': 'YZZ', 'freq': 3}, {'key': 'WHYQ', 'freq': 3},
{'key': 'HOWEE', 'freq': 2}, {'key': 'DUH', 'freq': 1},
{'key': 'HAHA', 'freq': 1}]
df = pd.DataFrame(rows)
df['percent'] = df['freq'] / sum(df['freq'])
bin_50_100 = []
bin_10_50 = []
bin_10 = []
total_percent = 1.0
for idx, row in df.sort_values(by=['freq', 'key'], ascending=False).iterrows():
if total_percent > 0.5:
bin_50_100.append(row['key'])
elif 0.1 < total_percent < 0.5:
bin_10_50.append(row['key'])
else:
bin_10.append(row['key'])
total_percent -= row['percent']
print(random.sample(bin_50_100, 1))
print(random.sample(bin_10_50, 2))
print(random.sample(bin_10, 4))
[out]:
['DEF']
['MNO', 'PQR']
['HOWEE', 'WHYQ', 'HAHA', 'DUH']
But is there a simpler way to solve the problem?
Let's try:
bins = [0, 0.1, 0.5, 1]
samples = [3,3,1]
df['sample'] = pd.cut(df.percent[::-1].cumsum(), # accumulate percentage
bins=[0, 0.1, 0.5, 1], # bins
labels=False # num samples
).astype(int)
df.groupby('sample').apply(lambda x: x.sample(n=samples[x['sample'].iloc[0])] )
Output:
key freq percent sample
sample
1 0 ABC 100 0.328947 1
2 2 GHI 50 0.164474 2
5 PQR 11 0.036184 2
4 7 VWX 10 0.032895 4
6 STU 10 0.032895 4
12 HAHA 1 0.003289 4
10 HOWEE 2 0.006579 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With