Now there are a lot of similar questions but most of them answer how to delete the duplicate columns. However, I want to know how can I make a list of tuples where each tuple contains the column names of duplicate columns. I am assuming that each column has a unique name. Just to further illustrate my question: <pre class="prettyprint"><code>df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [2, 4, 2, 1, 9], 'C': [1, 2, 3, 4, 5],'D': [2, 4, 2, 1, 9], 'E': [3, 4, 2, 1, 2],'F': [1, 1, 1, 1, 1]}, index = ['a1', 'a2', 'a3', 'a4', 'a5']) </code></pre> then I want the output: <pre class="prettyprint"><code>[('A', 'C'), ('B', 'D')] </code></pre> And if you are feeling great today then also extend the same question to rows. How to get a list of tuples where each tuple contains duplicate rows.

Here's one NumPy approach - <pre class="prettyprint"><code>def group_duplicate_cols(df): a = df.values sidx = np.lexsort(a) b = a[:,sidx] m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] )) idx = np.flatnonzero(m[1:] != m[:-1]) C = df.columns[sidx].tolist() return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)] </code></pre> Sample runs - <pre class="prettyprint"><code>In [100]: df Out[100]: A B C D E F a1 1 2 1 2 3 1 a2 2 4 2 4 4 1 a3 3 2 3 2 2 1 a4 4 1 4 1 1 1 a5 5 9 5 9 2 1 In [101]: group_duplicate_cols(df) Out[101]: [['A', 'C'], ['B', 'D']] # Let's add one more duplicate into group containing 'A' In [102]: df.F = df.A In [103]: group_duplicate_cols(df) Out[103]: [['A', 'C', 'F'], ['B', 'D']] </code></pre> Converting to do the same, but for rows(index), we just need to switch the operations along the other axis, like so - <pre class="prettyprint"><code>def group_duplicate_rows(df): a = df.values sidx = np.lexsort(a.T) b = a[sidx] m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] )) idx = np.flatnonzero(m[1:] != m[:-1]) C = df.index[sidx].tolist() return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)] </code></pre> Sample run - <pre class="prettyprint"><code>In [260]: df2 Out[260]: a1 a2 a3 a4 a5 A 3 5 3 4 5 B 1 1 1 1 1 C 3 5 3 4 5 D 2 9 2 1 9 E 2 2 2 1 2 F 1 1 1 1 1 In [261]: group_duplicate_rows(df2) Out[261]: [['B', 'F'], ['A', 'C']] </code></pre> <hr> <h3>Benchmarking</h3> Approaches - <pre class="prettyprint"><code># @John Galt's soln-1 from itertools import combinations def combinations_app(df): return[x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()] # @Abdou's soln def pandas_groupby_app(df): return [tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1] # @COLDSPEED's soln def triu_app(df): c = df.columns.tolist() i, j = np.triu_indices(len(c), 1) x = [(c[_i], c[_j]) for _i, _j in zip(i, j) if (df[c[_i]] == df[c[_j]]).all()] return x # @cmaher's soln def lambda_set_app(df): return list(filter(lambda x: len(x) > 1, list(set([tuple([x for x in df.columns if all(df[x] == df[y])]) for y in df.columns])))) </code></pre> Note : <code>@John Galt's soln-2</code> wasn't included because the inputs being of size <code>(8000,500)</code> would blow up with the proposed <code>broadcasting</code> for that one. Timings - <pre class="prettyprint"><code>In [179]: # Setup inputs with sizes as mentioned in the question ...: df = pd.DataFrame(np.random.randint(0,10,(8000,500))) ...: df.columns = ['C'+str(i) for i in range(df.shape[1])] ...: idx0 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0) ...: idx1 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0) ...: df.iloc[:,idx0] = df.iloc[:,idx1].values ...: # @John Galt's soln-1 In [180]: %timeit combinations_app(df) 1 loops, best of 3: 24.6 s per loop # @Abdou's soln In [181]: %timeit pandas_groupby_app(df) 1 loops, best of 3: 3.81 s per loop # @COLDSPEED's soln In [182]: %timeit triu_app(df) 1 loops, best of 3: 25.5 s per loop # @cmaher's soln In [183]: %timeit lambda_set_app(df) 1 loops, best of 3: 27.1 s per loop # Proposed in this post In [184]: %timeit group_duplicate_cols(df) 10 loops, best of 3: 188 ms per loop </code></pre> <hr> Super boost with NumPy's view functionality Leveraging NumPy's view functionality that lets us view each group of elements as one dtype, we could gain further noticeable performance boost, like so - <pre class="prettyprint"><code>def view1D(a): # a is array a = np.ascontiguousarray(a) void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1])) return a.view(void_dt).ravel() def group_duplicate_cols_v2(df): a = df.values sidx = view1D(a.T).argsort() b = a[:,sidx] m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] )) idx = np.flatnonzero(m[1:] != m[:-1]) C = df.columns[sidx].tolist() return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)] </code></pre> Timings - <pre class="prettyprint"><code>In [322]: %timeit group_duplicate_cols(df) 10 loops, best of 3: 185 ms per loop In [323]: %timeit group_duplicate_cols_v2(df) 10 loops, best of 3: 69.3 ms per loop </code></pre> Just crazy speedups!

Group duplicate column IDs in pandas dataframe

Tags:

python

pandas

dataframe

duplicates

numpy

Now there are a lot of similar questions but most of them answer how to delete the duplicate columns. However, I want to know how can I make a list of tuples where each tuple contains the column names of duplicate columns. I am assuming that each column has a unique name. Just to further illustrate my question:

df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [2, 4, 2, 1, 9],
                   'C': [1, 2, 3, 4, 5],'D': [2, 4, 2, 1, 9],
                   'E': [3, 4, 2, 1, 2],'F': [1, 1, 1, 1, 1]},
                   index = ['a1', 'a2', 'a3', 'a4', 'a5'])

then I want the output:

[('A', 'C'), ('B', 'D')]

And if you are feeling great today then also extend the same question to rows. How to get a list of tuples where each tuple contains duplicate rows.

205

asked Jul 09 '17 15:07

PallavBakshi

2 Answers

Here's one NumPy approach -

def group_duplicate_cols(df):
    a = df.values
    sidx = np.lexsort(a)
    b = a[:,sidx]

    m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    C = df.columns[sidx].tolist()
    return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

Sample runs -

In [100]: df
Out[100]: 
    A  B  C  D  E  F
a1  1  2  1  2  3  1
a2  2  4  2  4  4  1
a3  3  2  3  2  2  1
a4  4  1  4  1  1  1
a5  5  9  5  9  2  1

In [101]: group_duplicate_cols(df)
Out[101]: [['A', 'C'], ['B', 'D']]

# Let's add one more duplicate into group containing 'A'
In [102]: df.F = df.A

In [103]: group_duplicate_cols(df)
Out[103]: [['A', 'C', 'F'], ['B', 'D']]

Converting to do the same, but for rows(index), we just need to switch the operations along the other axis, like so -

def group_duplicate_rows(df):
    a = df.values
    sidx = np.lexsort(a.T)
    b = a[sidx]

    m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    C = df.index[sidx].tolist()
    return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

Sample run -

In [260]: df2
Out[260]: 
   a1  a2  a3  a4  a5
A   3   5   3   4   5
B   1   1   1   1   1
C   3   5   3   4   5
D   2   9   2   1   9
E   2   2   2   1   2
F   1   1   1   1   1

In [261]: group_duplicate_rows(df2)
Out[261]: [['B', 'F'], ['A', 'C']]

Benchmarking

Approaches -

# @John Galt's soln-1
from itertools import combinations
def combinations_app(df):
    return[x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]

# @Abdou's soln
def pandas_groupby_app(df):
    return [tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1]                        

# @COLDSPEED's soln
def triu_app(df):
    c = df.columns.tolist()
    i, j = np.triu_indices(len(c), 1)
    x = [(c[_i], c[_j]) for _i, _j in zip(i, j) if (df[c[_i]] == df[c[_j]]).all()]
    return x

# @cmaher's soln
def lambda_set_app(df):
    return list(filter(lambda x: len(x) > 1, list(set([tuple([x for x in df.columns if all(df[x] == df[y])]) for y in df.columns]))))

Note : @John Galt's soln-2 wasn't included because the inputs being of size (8000,500) would blow up with the proposed broadcasting for that one.

Timings -

In [179]: # Setup inputs with sizes as mentioned in the question
     ...: df = pd.DataFrame(np.random.randint(0,10,(8000,500)))
     ...: df.columns = ['C'+str(i) for i in range(df.shape[1])]
     ...: idx0 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
     ...: idx1 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
     ...: df.iloc[:,idx0] = df.iloc[:,idx1].values
     ...: 

# @John Galt's soln-1
In [180]: %timeit combinations_app(df)
1 loops, best of 3: 24.6 s per loop

# @Abdou's soln
In [181]: %timeit pandas_groupby_app(df)
1 loops, best of 3: 3.81 s per loop

# @COLDSPEED's soln
In [182]: %timeit triu_app(df)
1 loops, best of 3: 25.5 s per loop

# @cmaher's soln
In [183]: %timeit lambda_set_app(df)
1 loops, best of 3: 27.1 s per loop

# Proposed in this post
In [184]: %timeit group_duplicate_cols(df)
10 loops, best of 3: 188 ms per loop

Super boost with NumPy's view functionality

Leveraging NumPy's view functionality that lets us view each group of elements as one dtype, we could gain further noticeable performance boost, like so -

def view1D(a): # a is array
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel()

def group_duplicate_cols_v2(df):
    a = df.values
    sidx = view1D(a.T).argsort()
    b = a[:,sidx]

    m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
    idx = np.flatnonzero(m[1:] != m[:-1])
    C = df.columns[sidx].tolist()
    return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]

Timings -

In [322]: %timeit group_duplicate_cols(df)
10 loops, best of 3: 185 ms per loop

In [323]: %timeit group_duplicate_cols_v2(df)
10 loops, best of 3: 69.3 ms per loop

Just crazy speedups!

answered Oct 02 '22 23:10

Divakar

Here's a single-liner

In [22]: from itertools import combinations

In [23]: [x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]
Out[23]: [('A', 'C'), ('B', 'D')]

Alternatively, using NumPy broadcasting. Better, look at Divakar's solution

In [124]: cols = df.columns

In [125]: dftv = df.T.values

In [126]: cross = pd.DataFrame((dftv == dftv[:, None]).all(-1), cols, cols)

In [127]: cross
Out[127]:
       A      B      C      D      E      F
A   True  False   True  False  False  False
B  False   True  False   True  False  False
C   True  False   True  False  False  False
D  False   True  False   True  False  False
E  False  False  False  False   True  False
F  False  False  False  False  False   True

# Only take values from lower triangle
In [128]: s = cross.where(np.tri(*cross.shape, k=-1)).unstack()

In [129]: s[s == 1].index.tolist()
Out[129]: [('A', 'C'), ('B', 'D')]