In Pandas
we can drop duplicates by using dataframe.drop_duplicates()
which keeps the first row of the duplicate data by default. If keep_last = True
, the last row is kept.
How can we keep any random row and drop the duplicate rows using pandas drop_duplicate
?
maybe:
idx = np.random.permutation(np.arange(len(df)))
df.iloc[idx].drop_duplicates()
A Pythonic way to accomplish this:
df = df.sample(frac=1).drop_duplicates()
Here, we are taking a sample equal to the full size of the dataframe, without replacement. This effectively shuffles the position of all rows, allowing us to drop duplicates and keeping the first row, previously randomized.
If you need to keep the index in sequential order, you could also reset it:
df = df.sample(frac=1).drop_duplicates().reset_index(drop=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With