Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dropping duplicates randomly

Tags:

python

pandas

In Pandas we can drop duplicates by using dataframe.drop_duplicates() which keeps the first row of the duplicate data by default. If keep_last = True, the last row is kept. How can we keep any random row and drop the duplicate rows using pandas drop_duplicate?

like image 311
Abhishek Thakur Avatar asked Apr 04 '14 13:04

Abhishek Thakur


2 Answers

maybe:

idx = np.random.permutation(np.arange(len(df)))
df.iloc[idx].drop_duplicates()
like image 132
behzad.nouri Avatar answered Sep 27 '22 21:09

behzad.nouri


A Pythonic way to accomplish this:

df = df.sample(frac=1).drop_duplicates()

Here, we are taking a sample equal to the full size of the dataframe, without replacement. This effectively shuffles the position of all rows, allowing us to drop duplicates and keeping the first row, previously randomized.

If you need to keep the index in sequential order, you could also reset it:

df = df.sample(frac=1).drop_duplicates().reset_index(drop=True)
like image 21
gelidely Avatar answered Sep 27 '22 21:09

gelidely