I would like to drop rows within my dataframe based on if a piece of a string is duplicated within that string. For example, if the string is jkl-ghi-jkl, I would drop this row because jkl is repeated twice. I figured that creating a list and checking the list for duplicates would be the ideal approach.
My dataframe for this example consist of 1 column and two data points:
df1 = pd.DataFrame({'Col1' : ['abc-def-ghi-jkl', 'jkl-ghi-jkl-mno'],})
My first step I take is to apply a split to my data, and split of "-"
List = df1['Col1].str.split('-')
List
Which is yields the output:
0 [abc, def, ghi, jkl]
1 [jkl, ghi, jkl, mno]
Name: Col1, dtype: object
My second step I take is to convert my output into lists:
List = List.tolist()
Which yields:
[['abc', 'def', 'ghi', 'jkl'], ['jkl', 'ghi', 'jkl', 'mno']]
My last step I wish to accomplish is to compare a full list with a distinct list of unique values:
len(List) > len(set(List))
Which yields the error:
TypeError: unhashable type: 'list'
I am aware that my .tolist() creates a list of 2 series. Is there a way to convert these series into a list in order to test for duplicates? I wish to use this piece of code:
len(List) > len(set(List)
with a drop in order to drop all rows with a duplicated value within each cell.
Is this the correct way of approaching, or is there a simpler way?
My end output should look like:
Col1
abc-def-ghi-jkl
Because string jkl-ghi-jkl-mno gets dropped due to "jkl" repeating twice
Here is another option, using set
and len
:
df1 = pd.DataFrame({'Col1' : ['abc-def-ghi-jkl', 'jkl-ghi-jkl-mno'],})
df1['length'] = df1['Col1'].str.split('-').apply(set).apply(len)
print( df1 )
Col1 length
0 abc-def-ghi-jkl 4
1 jkl-ghi-jkl-mno 3
df1 = df1.loc[ df1['length'] < 4 ]
print(df1)
Col1 length
1 jkl-ghi-jkl-mno 3
You can combine str.split
with duplicated
to get a Boolean indexer:
# Get a Boolean indexer for duplicates.
dupe_rows = df1['Col1'].str.split('-', expand=True)
dupe_rows = dupe_rows.apply(lambda row: row.duplicated().any(), axis=1)
# Remove the duplicates.
df1 = df1[~dupe_rows]
Edit
Another option is to use toolz.isdistinct
in a similar manner as the other answers:
import toolz
df1[df1.Col1.str.split('-').apply(toolz.isdistinct)]
split
'Col1'
and apply a repeat checker using an efficient numpy
algorithm.
def nerpt(lst):
ti = np.triu_indices(len(lst), 1)
a = np.array(lst)
return (a[ti[0]] == a[ti[1]]).any()
df1[~df1.Col1.str.split('-').apply(nerpt)]
Pretty clear using set
is most efficient. This is reflective of @Luis's answer
Using pd.concat([df1 for _ in range(10000)])
rpt1 = lambda lst: not pd.Index(lst).is_unique
rpt2 = lambda lst: len(lst) != len(set(lst))
rpt3 = nerpt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With