Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shuffle column in panda dataframe with groupby

Tags:

python

pandas

I want to randomly shuffle the values for one single column of a dataframe based on a groupby. E.g., I have two columns A and B. Now, I want to randomly shuffle column B based on a groupby on A.

For an example, suppose that there are three distinct values in A. Now for each distinct value of A, I want to shuffle the values in B, but just with values having the same A.

Example input:

A       B     
------------
1       1          
1       3    
2       4     
3       6   
1       2  
3       5   

Example output:

A       B        
------------
1       3          
1       2    
2       4     
3       6   
1       1  
3       5  

In this case, for A=1 the values for B got shuffled. The same happened for A=2, but as there is only one row it stayed like it was. For A=3 by chance the values for B also stayed like they were.

I want to achieve it with Pandas.

like image 294
fsociety Avatar asked Dec 04 '22 02:12

fsociety


2 Answers

For this you could combine np.random.permutation (which returns a shuffled version of an array) with a groupby and a transform (which returns a like-indexed version of the group). For example:

>>> df
   col1  col2
0     1     1
1     1     3
2     2     4
3     3     6
4     1     2
5     3     5
>>> df["col3"] = df.groupby("col1")["col2"].transform(np.random.permutation)
>>> df
   col1  col2  col3
0     1     1     2
1     1     3     1
2     2     4     4
3     3     6     5
4     1     2     3
5     3     5     6

Note that the values are only shuffled within their col1 groups.

like image 163
DSM Avatar answered Dec 14 '22 15:12

DSM


You can also use groupby together with sample:

df = pd.DataFrame({'col1': [1, 1, 2, 3, 1, 3], 
                   'col2': [1, 3, 4, 6, 2, 5]})

df_rand = df.groupby('col1').apply(lambda x: x.sample(frac=1)).reset_index(drop=True)

>>> df.sort('col1')
   col1  col2
0     1     1
1     1     3
4     1     2
2     2     4
3     3     6
5     3     5

>>> df_rand
   col1  col2
0     1     2
1     1     3
2     1     1
3     2     4
4     3     6
5     3     5
like image 27
Alexander Avatar answered Dec 14 '22 14:12

Alexander