Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shuffle a pandas dataframe by groups

My dataframe looks like this

sampleID  col1 col2
   1        1   63
   1        2   23
   1        3   73
   2        1   20
   2        2   94
   2        3   99
   3        1   73
   3        2   56
   3        3   34

I need to shuffle the dataframe keeping same samples together and the order of the col1 must be same as in above dataframe.

So I need it like this

sampleID  col1 col2
   2        1   20
   2        2   94
   2        3   99
   3        1   73
   3        2   56
   3        3   34
   1        1   63
   1        2   23
   1        3   73

How can I do this? If my example is not clear please let me know.

like image 287
Test Test Avatar asked Aug 09 '17 08:08

Test Test


2 Answers

Assuming you want to shuffle by sampleID. First df.groupby, shuffle (import random first), and then call pd.concat:

import random

groups = [df for _, df in df.groupby('sampleID')]
random.shuffle(groups)

pd.concat(groups).reset_index(drop=True)

   sampleID  col1  col2
0         2     1    20
1         2     2    94
2         2     3    99
3         1     1    63
4         1     2    23
5         1     3    73
6         3     1    73
7         3     2    56
8         3     3    34

You reset the index with df.reset_index(drop=True), but it is an optional step.

like image 159
cs95 Avatar answered Sep 19 '22 05:09

cs95


I found this to be significantly faster than the accepted answer:

ids = df["sampleID"].unique()
random.shuffle(ids)
df = df.set_index("sampleID").loc[ids].reset_index()

for some reason the pd.concat was the bottleneck in my usecase. Regardless this way you avoid the concatenation.

like image 39
sachinruk Avatar answered Sep 21 '22 05:09

sachinruk