Say I have a dataframe of the form where rn
is the row index
A1 | A2 | A3
-----------------
r1 x | 0 | t
r2 y | 1 | u
r3 z | 1 | v
r4 x | 2 | w
r5 z | 2 | v
r6 x | 2 | w
If I wanted to subset this dataframe such that the column A2 has only unique values, I'd use df.drop_duplicates('A2')
. However, that'd keep only the first row of the unique value and delete the rest. For this example, only r2 and r4 will be in the subset.
What I want is that any of the rows with duplicate values are selected randomly rather than the first row. So for this example, for A2 == 1
, r2 or r3 is selected randomly or for A2 == 2
any of r4, r5 or r6 is selected randomly. How would I go about implementing this?
And you can use the following syntax to select unique rows across specific columns in a pandas DataFrame: df = df. drop_duplicates(subset=['col1', 'col2', ...])
One of the easiest ways to shuffle a Pandas Dataframe is to use the Pandas sample method. The df. sample method allows you to sample a number of rows in a Pandas Dataframe in a random order. Because of this, we can simply specify that we want to return the entire Pandas Dataframe, in a random order.
Shuffle the DataFrame first and then drop the duplicates:
df.sample(frac=1).drop_duplicates(subset='A2')
If the order of the rows is important you can use sort_index
as @cᴏʟᴅsᴘᴇᴇᴅ suggested:
df.sample(frac=1).drop_duplicates(subset='A2').sort_index()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With