dropping duplicates randomly

Question

In Pandas we can drop duplicates by using dataframe.drop_duplicates() which keeps the first row of the duplicate data by default. If keep_last = True, the last row is kept. How can we keep any random row and drop the duplicate rows using pandas drop_duplicate?

behzad.nouri · Accepted Answer

maybe:

idx = np.random.permutation(np.arange(len(df)))
df.iloc[idx].drop_duplicates()

gelidely · Answer

A Pythonic way to accomplish this:

df = df.sample(frac=1).drop_duplicates()

Here, we are taking a sample equal to the full size of the dataframe, without replacement. This effectively shuffles the position of all rows, allowing us to drop duplicates and keeping the first row, previously randomized.

If you need to keep the index in sequential order, you could also reset it:

df = df.sample(frac=1).drop_duplicates().reset_index(drop=True)

dropping duplicates randomly

Tags:

python

pandas

Abhishek Thakur

2 Answers

behzad.nouri

gelidely

Recent Activity

Donate For Us

dropping duplicates randomly

Tags:

python

pandas

Abhishek Thakur

2 Answers

behzad.nouri

gelidely

Related questions

Recent Activity

Donate For Us