What is the pandas way of finding the indices of identical rows within a given DataFrame without iterating over individual rows?
While it is possible to find all unique rows with unique = df[df.duplicated()]
and then iterating over the unique entries with unique.iterrows()
and extracting the indices of equal entries with help of pd.where()
, what is the pandas way of doing it?
Example: Given a DataFrame of the following structure:
| param_a | param_b | param_c
1 | 0 | 0 | 0
2 | 0 | 2 | 1
3 | 2 | 1 | 1
4 | 0 | 2 | 1
5 | 2 | 1 | 1
6 | 0 | 0 | 0
Output:
[(1, 6), (2, 4), (3, 5)]
duplicated() function Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated.
The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
repeat() function repeat elements of an Index. The function returns a new index where each element of the current index is repeated consecutively a given number of times.
Use parameter duplicated
with keep=False
for all dupe rows and then groupby
by all columns and convert index values to tuples, last convert output Series
to list
:
df = df[df.duplicated(keep=False)]
df = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist()
print (df)
[(1, 6), (2, 4), (3, 5)]
If you want also see duplicate values:
df1 = (df.groupby(df.columns.tolist())
.apply(lambda x: tuple(x.index))
.reset_index(name='idx'))
print (df1)
param_a param_b param_c idx
0 0 0 0 (1, 6)
1 0 2 1 (2, 4)
2 2 1 1 (3, 5)
Approach #1
Here's one vectorized approach inspired by this post
-
def group_duplicate_index(df):
a = df.values
sidx = np.lexsort(a.T)
b = a[sidx]
m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
idx = np.flatnonzero(m[1:] != m[:-1])
I = df.index[sidx].tolist()
return [I[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
Sample run -
In [42]: df
Out[42]:
param_a param_b param_c
1 0 0 0
2 0 2 1
3 2 1 1
4 0 2 1
5 2 1 1
6 0 0 0
In [43]: group_duplicate_index(df)
Out[43]: [[1, 6], [3, 5], [2, 4]]
Approach #2
For integer numbered dataframes, we could reduce each row to a scalar each and that lets us work with a 1D
array, giving us a more performant one, like so -
def group_duplicate_index_v2(df):
a = df.values
s = (a.max()+1)**np.arange(df.shape[1])
sidx = a.dot(s).argsort()
b = a[sidx]
m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
idx = np.flatnonzero(m[1:] != m[:-1])
I = df.index[sidx].tolist()
return [I[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
Runtime test
Other approach(es) -
def groupby_app(df): # @jezrael's soln
df = df[df.duplicated(keep=False)]
df = df.groupby(df.columns.tolist()).apply(lambda x: tuple(x.index)).tolist()
return df
Timings -
In [274]: df = pd.DataFrame(np.random.randint(0,10,(100000,3)))
In [275]: %timeit group_duplicate_index(df)
10 loops, best of 3: 36.1 ms per loop
In [276]: %timeit group_duplicate_index_v2(df)
100 loops, best of 3: 15 ms per loop
In [277]: %timeit groupby_app(df) # @jezrael's soln
10 loops, best of 3: 25.9 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With