Find indices of duplicate rows in pandas DataFrame

Tags:

What is the pandas way of finding the indices of identical rows within a given DataFrame without iterating over individual rows?

While it is possible to find all unique rows with unique = df[df.duplicated()] and then iterating over the unique entries with unique.iterrows() and extracting the indices of equal entries with help of pd.where(), what is the pandas way of doing it?

Example: Given a DataFrame of the following structure:

Click to copy

  | param_a | param_b | param_c
1 | 0       | 0       | 0
2 | 0       | 2       | 1
3 | 2       | 1       | 1
4 | 0       | 2       | 1
5 | 2       | 1       | 1
6 | 0       | 0       | 0

Output:

Click to copy

[(1, 6), (2, 4), (3, 5)]

773

asked Oct 08 '17 09:10

Genius

2 Answers

Use parameter duplicated with keep=False for all dupe rows and then groupby by all columns and convert index values to tuples, last convert output Series to list:

Click to copy

df = df[df.duplicated(keep=False)]

df = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist()
print (df)
[(1, 6), (2, 4), (3, 5)]

If you want also see duplicate values:

Click to copy

df1 = (df.groupby(df.columns.tolist())
       .apply(lambda x: tuple(x.index))
       .reset_index(name='idx'))
print (df1)
   param_a  param_b  param_c     idx
0        0        0        0  (1, 6)
1        0        2        1  (2, 4)
2        2        1        1  (3, 5)

183

answered Sep 19 '22 11:09

jezrael

Approach #1

Here's one vectorized approach inspired by this post-

Click to copy

def group_duplicate_index(df):
    a = df.values
    sidx = np.lexsort(a.T)
    b = a[sidx]

    m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    I = df.index[sidx].tolist()       
    return [I[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

Sample run -

Click to copy

In [42]: df
Out[42]: 
   param_a  param_b  param_c
1        0        0        0
2        0        2        1
3        2        1        1
4        0        2        1
5        2        1        1
6        0        0        0

In [43]: group_duplicate_index(df)
Out[43]: [[1, 6], [3, 5], [2, 4]]

Approach #2

For integer numbered dataframes, we could reduce each row to a scalar each and that lets us work with a 1D array, giving us a more performant one, like so -

Click to copy

def group_duplicate_index_v2(df):
    a = df.values
    s = (a.max()+1)**np.arange(df.shape[1])
    sidx = a.dot(s).argsort()
    b = a[sidx]

    m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    I = df.index[sidx].tolist() 
    return [I[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

Runtime test

Other approach(es) -

Click to copy

def groupby_app(df): # @jezrael's soln
    df = df[df.duplicated(keep=False)]
    df = df.groupby(df.columns.tolist()).apply(lambda x: tuple(x.index)).tolist()
    return df

Timings -

Click to copy

In [274]: df = pd.DataFrame(np.random.randint(0,10,(100000,3)))

In [275]: %timeit group_duplicate_index(df)
10 loops, best of 3: 36.1 ms per loop

In [276]: %timeit group_duplicate_index_v2(df)
100 loops, best of 3: 15 ms per loop

In [277]: %timeit groupby_app(df) # @jezrael's soln
10 loops, best of 3: 25.9 ms per loop

answered Sep 18 '22 11:09

Divakar

Related questions
                            
                                Memory efficient way to split large numpy array into train and test
                            
                                non-blocking lock with 'with' statement
                            
                                How to detect if a point is contained within a bounding rect - opecv & python
                            
                                Luigi Pipeline beginning in S3
                            
                                Callbacks with ctypes (How to call a python function from C)
                            
                                Problems implementing an XOR gate with Neural Nets in Tensorflow
                            
                                Interpolating a closed curve using scipy
                            
                                How do I order fields of my Row objects in Spark (Python)
                            
                                How can I send an email using python logging's SMTPHandler and SSL
                            
                                Doing pairwise distance computation with TensorFlow
                            
                                How to fillna() with value 0 after calling resample?
                            
                                Spyder / iPython inline plot figure size
                            
                                Why does a class need __iter__() to return an iterator?
                            
                                ValueError: time data does not match format '%Y-%m-%d %H:%M:%S.%f'
                            
                                reshape a pandas dataframe
                            
                                Difference between dictionary and pandas series in Python
                            
                                How to use an update function to animate a NetworkX graph in Matplotlib 2.0.0?
                            
                                using Tensorflow with Anaconda and PyCharm on Windows
                            
                                AttributeError: module 'cv2.cv2' has no attribute 'cv'
                            
                                Python pandas - new column's value if the item is in the list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find indices of duplicate rows in pandas DataFrame

Tags:

python

pandas

dataframe

Genius

People also ask

2 Answers

jezrael

Divakar

Recent Activity

Donate For Us