This is my dataframe:
pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
'B':[0, 2, 3, 4, 5, 6, 7],
'C':[[1,4,4,4], [1,4,4,4], [3,4,4,5], [3,4,4,5], [4,4,2,1], [1,2,3,4,], [7,8,9,1]]})
I want to get set\drop duplicate values of column C per row but not drop duplicate rows.
This what I hope to get:
pd.DataFrame({'A':[1, 3, 3, 4, 5, 3, 3],
'B':[0, 2, 3, 4, 5, 6, 7],
'C':[[1,4], [1,4], [3,4,5], [3,4,5], [4,2,1], [1,2,3,4,], [7,8,9,1]]})
Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
If the order of the elements is not critical, we can remove duplicates using the Set method and the Numpy unique() function. We can use Pandas functions, OrderedDict, reduce() function, Set + sort() method, and iterative approaches to keep the order of elements.
If you're using python 3.7>, you could could map
with dict.fromkeys
, and obtain a list from the dictionary keys (the version is relevant since insertion order is maintained starting from there):
df['C'] = df.C.map(lambda x: list(dict.fromkeys(x).keys()))
For older pythons you have collections.OrderedDict
:
from collections import OrderedDict
df['c']= df.C.map(lambda x: list(OrderedDict.fromkeys(x).keys()))
print(df)
A B C
0 1 0 [1, 4]
1 3 2 [1, 4]
2 3 3 [3, 4, 5]
3 4 4 [3, 4, 5]
4 5 5 [4, 2, 1]
5 3 6 [1, 2, 3, 4]
6 3 7 [7, 8, 9, 1]
As mentioned by cs95 in the comments, if we don't need to preserve order we could go with a set
for a more concise approach:
df['c'] = df.C.map(lambda x: [*{*x}])
Since several approaches have been proposed and is hard to tell how they will perform on large dataframes, probably worth benchmarking:
df = pd.concat([df]*50000, axis=0).reset_index(drop=True)
perfplot.show(
setup=lambda n: df.iloc[:int(n)],
kernels=[
lambda df: df.C.map(lambda x: list(dict.fromkeys(x).keys())),
lambda df: df['C'].map(lambda x: pd.factorize(x)[1]),
lambda df: [np.unique(item) for item in df['C'].values],
lambda df: df['C'].explode().groupby(level=0).unique(),
lambda df: df.C.map(lambda x: [*{*x}]),
],
labels=['dict.from_keys', 'factorize', 'np.unique', 'explode', 'set'],
n_range=[2**k for k in range(0, 18)],
xlabel='N',
equality_check=None
)
if order is of no importance you could cast the column to a numpy array and apply an operation on each row in a list comprehension.
import numpy as np
df['C_Unique'] = [np.unique(item) for item in df['C'].values]
print(df)
A B C C_Unique
0 1 0 [1, 4, 4, 4] [1, 4]
1 3 2 [1, 4, 4, 4] [1, 4]
2 3 3 [3, 4, 4, 5] [3, 4, 5]
3 4 4 [3, 4, 4, 5] [3, 4, 5]
4 5 5 [4, 4, 2, 1] [1, 2, 4]
5 3 6 [1, 2, 3, 4] [1, 2, 3, 4]
6 3 7 [7, 8, 9, 1] [1, 7, 8, 9]
Another method would be to to use explode
and groupby.unique
df['CExplode'] = df['C'].explode().groupby(level=0).unique()
A B C C_Unique CExplode
0 1 0 [1, 4] [1, 4] [1, 4]
1 3 2 [1, 4] [1, 4] [1, 4]
2 3 3 [3, 4, 5] [3, 4, 5] [3, 4, 5]
3 4 4 [3, 4, 5] [3, 4, 5] [3, 4, 5]
4 5 5 [4, 2, 1] [1, 2, 4] [4, 2, 1]
5 3 6 [1, 2, 3, 4] [1, 2, 3, 4] [1, 2, 3, 4]
6 3 7 [7, 8, 9, 1] [1, 7, 8, 9] [7, 8, 9, 1]
You can use apply function in pandas.
df['C'] = df['C'].apply(lambda x: list(set(x)))
map
and factorize
Let's throw one more into the mix.
df['C'].map(pd.factorize).str[1]
0 [1, 4]
1 [1, 4]
2 [3, 4, 5]
3 [3, 4, 5]
4 [4, 2, 1]
5 [1, 2, 3, 4]
6 [7, 8, 9, 1]
Name: C, dtype: object
Or,
df['C'].map(lambda x: pd.factorize(x)[1])
0 [1, 4]
1 [1, 4]
2 [3, 4, 5]
3 [3, 4, 5]
4 [4, 2, 1]
5 [1, 2, 3, 4]
6 [7, 8, 9, 1]
Name: C, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With