Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly select unique row from dataframe in Pandas

Say I have a dataframe of the form where rn is the row index

       A1  |  A2 |  A3 
      -----------------
r1     x   |  0  |  t
r2     y   |  1  |  u
r3     z   |  1  |  v
r4     x   |  2  |  w
r5     z   |  2  |  v
r6     x   |  2  |  w

If I wanted to subset this dataframe such that the column A2 has only unique values, I'd use df.drop_duplicates('A2'). However, that'd keep only the first row of the unique value and delete the rest. For this example, only r2 and r4 will be in the subset.

What I want is that any of the rows with duplicate values are selected randomly rather than the first row. So for this example, for A2 == 1, r2 or r3 is selected randomly or for A2 == 2 any of r4, r5 or r6 is selected randomly. How would I go about implementing this?

like image 521
HMK Avatar asked Nov 13 '17 19:11

HMK


People also ask

How do you select unique rows in Pandas?

And you can use the following syntax to select unique rows across specific columns in a pandas DataFrame: df = df. drop_duplicates(subset=['col1', 'col2', ...])

How do you randomly shuffle rows in a DataFrame?

One of the easiest ways to shuffle a Pandas Dataframe is to use the Pandas sample method. The df. sample method allows you to sample a number of rows in a Pandas Dataframe in a random order. Because of this, we can simply specify that we want to return the entire Pandas Dataframe, in a random order.


1 Answers

Shuffle the DataFrame first and then drop the duplicates:

df.sample(frac=1).drop_duplicates(subset='A2')

If the order of the rows is important you can use sort_index as @cᴏʟᴅsᴘᴇᴇᴅ suggested:

df.sample(frac=1).drop_duplicates(subset='A2').sort_index()
like image 123
ayhan Avatar answered Sep 28 '22 08:09

ayhan