I want to randomly shuffle the values for one single column of a dataframe based on a groupby. E.g., I have two columns A and B. Now, I want to randomly shuffle column B based on a groupby on A.
For an example, suppose that there are three distinct values in A. Now for each distinct value of A, I want to shuffle the values in B, but just with values having the same A.
Example input:
A B
------------
1 1
1 3
2 4
3 6
1 2
3 5
Example output:
A B
------------
1 3
1 2
2 4
3 6
1 1
3 5
In this case, for A=1
the values for B got shuffled. The same happened for A=2
, but as there is only one row it stayed like it was. For A=3
by chance the values for B also stayed like they were.
I want to achieve it with Pandas.
For this you could combine np.random.permutation
(which returns a shuffled version of an array) with a groupby
and a transform
(which returns a like-indexed version of the group). For example:
>>> df
col1 col2
0 1 1
1 1 3
2 2 4
3 3 6
4 1 2
5 3 5
>>> df["col3"] = df.groupby("col1")["col2"].transform(np.random.permutation)
>>> df
col1 col2 col3
0 1 1 2
1 1 3 1
2 2 4 4
3 3 6 5
4 1 2 3
5 3 5 6
Note that the values are only shuffled within their col1 groups.
You can also use groupby
together with sample
:
df = pd.DataFrame({'col1': [1, 1, 2, 3, 1, 3],
'col2': [1, 3, 4, 6, 2, 5]})
df_rand = df.groupby('col1').apply(lambda x: x.sample(frac=1)).reset_index(drop=True)
>>> df.sort('col1')
col1 col2
0 1 1
1 1 3
4 1 2
2 2 4
3 3 6
5 3 5
>>> df_rand
col1 col2
0 1 2
1 1 3
2 1 1
3 2 4
4 3 6
5 3 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With