I have a large pandas dataframe with about 10,000,000 rows. Each one represents a feature vector. The feature vectors come in natural groups and the group label is in a column called group_id
. I would like to randomly sample 10%
say of the rows but in proportion to the numbers of each group_id
.
For example, if the group_id's
are A, B, A, C, A, B
then I would like half of my sampled rows to have group_id
A
, two sixths to have group_id
B
and one sixth to have group_id
C
.
I can see the pandas function sample but I am not sure how to use it to achieve this goal.
You can use groupby and sample
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))
the following sample a total of N row where each group appear in its original proportion to the nearest integer, then shuffle and reset the index using:
df = pd.DataFrame(dict(
A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
B=range(20)
))
Short and sweet:
df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)
Long version
df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)
I was looking for similar solution. The code provided by @Vaishali works absolutely fine. What @Abdou's trying to do also makes sense when we want to extract samples from each group based on their proportions to the full data.
# original : 10% from each group
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))
# modified : sample size based on proportions of group size
n = df.shape[0]
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=length(x)/n))
This is not as simple as just grouping and using .sample
. You need to actually get the fractions first. Since you said that you are looking to grab 10% of the total numbers of rows in different proportions, you will need to calculate how much each group will have to take out from the main dataframe. For instance, if we use the divide you mentioned in the question, then group A
will end up with 1/20
for a fraction of the total number of rows, group B
will get 1/30
and group C
ends up with 1/60
. You can put these fractions in a dictionary and then use .groupby
and pd.concat
to concatenate the number of rows* from each group into a dataframe. You will be using the n
parameter from the .sample
method instead of the frac
parameter.
fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
N = len(df)
pd.concat(dff.sample(n=int(fracs.get(i)*N)) for i,dff in df.groupby('group_id'))
This is to highlight the importance in fulfilling the requirement that group_id A should have half of the sampled rows, group_id B two sixths of the sampled rows and group_id C one sixth of the sampled rows, regardless of the original group divides.
Starting with equal portions: each group starts with 40 rows
df1 = pd.DataFrame({'group_id': ['A','B', 'C']*40,
'vals': np.random.randn(120)})
N = len(df1)
fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df1.groupby('group_id')))
# group_id vals
# 12 A -0.175109
# 51 A -1.936231
# 81 A 2.057427
# 111 A 0.851301
# 114 A 0.669910
# 60 A 1.226954
# 73 B -0.166516
# 82 B 0.662789
# 94 B -0.863640
# 31 B 0.188097
# 101 C 1.802802
# 53 C 0.696984
print(df1.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))
# group_id vals
# group_id
# A 24 A 0.161328
# 21 A -1.399320
# 30 A -0.115725
# 114 A 0.669910
# B 34 B -0.348558
# 7 B -0.855432
# 106 B -1.163899
# 79 B 0.532049
# C 65 C -2.836438
# 95 C 1.701192
# 80 C -0.421549
# 74 C -1.089400
First solution: 6 rows for group A (1/2 of the sampled rows), 4 rows for group B (one third of the sampled rows) and 2 rows for group C (one sixth of the sampled rows).
Second solution: 4 rows for each group (each one third of the sampled rows)
Working with differently sized groups: 40 for A, 60 for B and 20 for C
df2 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (40, 60, 20)),
'vals': np.random.randn(120)})
N = len(df2)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df2.groupby('group_id')))
# group_id vals
# 29 A 0.306738
# 35 A 1.785479
# 21 A -0.119405
# 4 A 2.579824
# 5 A 1.138887
# 11 A 0.566093
# 80 B 1.207676
# 41 B -0.577513
# 44 B 0.286967
# 77 B 0.402427
# 103 C -1.760442
# 114 C 0.717776
print(df2.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))
# group_id vals
# group_id
# A 4 A 2.579824
# 32 A 0.451882
# 5 A 1.138887
# 17 A -0.614331
# B 47 B -0.308123
# 52 B -1.504321
# 42 B -0.547335
# 84 B -1.398953
# 61 B 1.679014
# 66 B 0.546688
# C 105 C 0.988320
# 107 C 0.698790
First solution: consistent Second solution: Now group B has taken 6 of the sampled rows when it's supposed to only take 4.
Working with another set of differently sized groups: 60 for A, 40 for B and 20 for C
df3 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (60, 40, 20)),
'vals': np.random.randn(120)})
N = len(df3)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df3.groupby('group_id')))
# group_id vals
# 48 A 1.214525
# 19 A -0.237562
# 0 A 3.385037
# 11 A 1.948405
# 8 A 0.696629
# 39 A -0.422851
# 62 B 1.669020
# 94 B 0.037814
# 67 B 0.627173
# 93 B 0.696366
# 104 C 0.616140
# 113 C 0.577033
print(df3.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))
# group_id vals
# group_id
# A 4 A 0.284448
# 11 A 1.948405
# 8 A 0.696629
# 0 A 3.385037
# 31 A 0.579405
# 24 A -0.309709
# B 70 B -0.480442
# 69 B -0.317613
# 96 B -0.930522
# 80 B -1.184937
# C 101 C 0.420421
# 106 C 0.058900
This is the only time the second solution offered some consistency (out of sheer luck, I might add).
I hope this proves useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With