Drop duplicate list elements in column of lists

Tags:

This is my dataframe:

pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
              'B':[0, 2, 3, 4, 5, 6, 7],
              'C':[[1,4,4,4], [1,4,4,4], [3,4,4,5], [3,4,4,5], [4,4,2,1], [1,2,3,4,], [7,8,9,1]]})

I want to get set\drop duplicate values of column C per row but not drop duplicate rows.

This what I hope to get:

pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
              'B':[0, 2, 3, 4, 5, 6, 7],
              'C':[[1,4], [1,4], [3,4,5], [3,4,5], [4,2,1], [1,2,3,4,], [7,8,9,1]]})

523

asked Jul 13 '20 08:07

matan

4 Answers

If you're using python 3.7>, you could could map with dict.fromkeys, and obtain a list from the dictionary keys (the version is relevant since insertion order is maintained starting from there):

df['C'] = df.C.map(lambda x: list(dict.fromkeys(x).keys()))

For older pythons you have collections.OrderedDict:

from collections import OrderedDict
df['c']= df.C.map(lambda x: list(OrderedDict.fromkeys(x).keys()))

print(df)

   A  B             C
0  1  0        [1, 4]
1  3  2        [1, 4]
2  3  3     [3, 4, 5]
3  4  4     [3, 4, 5]
4  5  5     [4, 2, 1]
5  3  6  [1, 2, 3, 4]
6  3  7  [7, 8, 9, 1]

As mentioned by cs95 in the comments, if we don't need to preserve order we could go with a set for a more concise approach:

df['c'] = df.C.map(lambda x: [*{*x}])

Since several approaches have been proposed and is hard to tell how they will perform on large dataframes, probably worth benchmarking:

df = pd.concat([df]*50000, axis=0).reset_index(drop=True)

perfplot.show(
    setup=lambda n: df.iloc[:int(n)], 

    kernels=[
        lambda df: df.C.map(lambda x: list(dict.fromkeys(x).keys())),
        lambda df: df['C'].map(lambda x: pd.factorize(x)[1]),
        lambda df: [np.unique(item) for item in df['C'].values],
        lambda df: df['C'].explode().groupby(level=0).unique(),
        lambda df: df.C.map(lambda x: [*{*x}]),
    ],

    labels=['dict.from_keys', 'factorize', 'np.unique', 'explode', 'set'],
    n_range=[2**k for k in range(0, 18)],
    xlabel='N',
    equality_check=None
)

enter image description here

158

answered Oct 21 '22 12:10

yatu

if order is of no importance you could cast the column to a numpy array and apply an operation on each row in a list comprehension.

import numpy as np
df['C_Unique'] = [np.unique(item) for item in df['C'].values]

print(df)

   A  B             C      C_Unique
0  1  0  [1, 4, 4, 4]        [1, 4]
1  3  2  [1, 4, 4, 4]        [1, 4]
2  3  3  [3, 4, 4, 5]     [3, 4, 5]
3  4  4  [3, 4, 4, 5]     [3, 4, 5]
4  5  5  [4, 4, 2, 1]     [1, 2, 4]
5  3  6  [1, 2, 3, 4]  [1, 2, 3, 4]
6  3  7  [7, 8, 9, 1]  [1, 7, 8, 9]

Another method would be to to use explode and groupby.unique

df['CExplode'] = df['C'].explode().groupby(level=0).unique()

  A  B             C      C_Unique      CExplode
0  1  0        [1, 4]        [1, 4]        [1, 4]
1  3  2        [1, 4]        [1, 4]        [1, 4]
2  3  3     [3, 4, 5]     [3, 4, 5]     [3, 4, 5]
3  4  4     [3, 4, 5]     [3, 4, 5]     [3, 4, 5]
4  5  5     [4, 2, 1]     [1, 2, 4]     [4, 2, 1]
5  3  6  [1, 2, 3, 4]  [1, 2, 3, 4]  [1, 2, 3, 4]
6  3  7  [7, 8, 9, 1]  [1, 7, 8, 9]  [7, 8, 9, 1]

answered Oct 21 '22 13:10

Umar.H

You can use apply function in pandas.

df['C'] = df['C'].apply(lambda x: list(set(x)))

answered Oct 21 '22 12:10

Ashok Krishna

`map` and `factorize`

Let's throw one more into the mix.

df['C'].map(pd.factorize).str[1]

0          [1, 4]
1          [1, 4]
2       [3, 4, 5]
3       [3, 4, 5]
4       [4, 2, 1]
5    [1, 2, 3, 4]
6    [7, 8, 9, 1]
Name: C, dtype: object

Or,

df['C'].map(lambda x: pd.factorize(x)[1])

0          [1, 4]
1          [1, 4]
2       [3, 4, 5]
3       [3, 4, 5]
4       [4, 2, 1]
5    [1, 2, 3, 4]
6    [7, 8, 9, 1]
Name: C, dtype: object

answered Oct 21 '22 11:10

cs95

Related questions
                            
                                Convert negative index in Python to positive index
                            
                                nearest intersection point to many lines in python
                            
                                ignoring newline character in regex match
                            
                                Enable debug mode in Flask in production mode
                            
                                pandas: Keep only top n values and set others to 0
                            
                                can't understand scipy.sparse.csr_matrix example
                            
                                How to speed up python instance initialization for millions of objects?
                            
                                Create symlink with pathlib
                            
                                Failed installing pyaudio on Google Colab with "ERROR: Failed building wheel for pyaudio"
                            
                                How to pass a list/array as argument to python fire?
                            
                                pandas dataframe masks to write values into new column
                            
                                Python print random line from file without repeat
                            
                                `pip install --upgrade pip` fails inside a Windows virtualenv with "Access denied"
                            
                                Plotly: How to apply different titles for each different subplots?
                            
                                How to send server-side events from python (fastapi) upon calls to a function that updates the backend state
                            
                                install conda package to google colab
                            
                                Python iterate through connected components in grayscale image
                            
                                Poor scaling of multiprocessing Pool.map() on a list of large objects: How to achieve better parallel scaling in python?
                            
                                How do I fix "pip"?
                            
                                How can I add text labels to a Plotly scatter plot in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Drop duplicate list elements in column of lists

Tags:

python

pandas

set

drop-duplicates

matan

People also ask

4 Answers

yatu

Umar.H

Ashok Krishna

`map` and `factorize`

cs95

Recent Activity

Donate For Us

Drop duplicate list elements in column of lists

Tags:

python

pandas

set

drop-duplicates

matan

People also ask

4 Answers

yatu

Umar.H

Ashok Krishna

map and factorize

cs95

Related questions

Recent Activity

Donate For Us

`map` and `factorize`