Say I have a dataframe of the form where rn is the row index
       A1  |  A2 |  A3 
      -----------------
r1     x   |  0  |  t
r2     y   |  1  |  u
r3     z   |  1  |  v
r4     x   |  2  |  w
r5     z   |  2  |  v
r6     x   |  2  |  w
If I wanted to subset this dataframe such that the column A2 has only unique values, I'd use df.drop_duplicates('A2'). However, that'd keep only the first row of the unique value and delete the rest. For this example, only r2 and r4 will be in the subset.
What I want is that any of the rows with duplicate values are selected randomly rather than the first row. So for this example, for A2 == 1, r2 or r3 is selected randomly or for A2 == 2 any of r4, r5 or r6 is selected randomly. How would I go about implementing this?
And you can use the following syntax to select unique rows across specific columns in a pandas DataFrame: df = df. drop_duplicates(subset=['col1', 'col2', ...])
One of the easiest ways to shuffle a Pandas Dataframe is to use the Pandas sample method. The df. sample method allows you to sample a number of rows in a Pandas Dataframe in a random order. Because of this, we can simply specify that we want to return the entire Pandas Dataframe, in a random order.
Shuffle the DataFrame first and then drop the duplicates:
df.sample(frac=1).drop_duplicates(subset='A2')
If the order of the rows is important you can use sort_index as @cᴏʟᴅsᴘᴇᴇᴅ suggested:
df.sample(frac=1).drop_duplicates(subset='A2').sort_index()
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With