I am having trouble using duplicated or drop_duplicates to find / remove all duplicates from a dataframe.
My data looks something like this, however, my data is 52k rows long.
data = {'inventory number':['WL-SMART-INWALL',
'WL-NMDISH-22',
'WL-MPS546-MESH',
'WAS-WG-500P',
'UKS/99757/69975',
'UKS/99757/69975',
'UKS/99750/S26361F2293L10',
'UKS/99750/S26361F2293L10',
'UKS/99733/69973',
'UKS/99733/69973',
'UKS/99727/AHD6502TU3CBK',
'UKS/99727/AHD6502TU3CBK',
'UKS/99725/PMK01',
'UKS/99725/PMK01',
'UKS/99294/A3L791R15MS',
'UKS/99294/A3L791R15MS',
'UKS/98865/58018251',
'UKS/98865/58018251',
'UKS/98509/90Q653AN1N0N2UA0',
'UKS/98509/90Q653AN1N0N2UA0',
'UKS/97771/FIBLCSC2',
'UKS/97771/FIBLCSC2',
'UKS/97627/FIBLCLC1',
'UKS/97627/FIBLCLC1'],
'minimum price': ['36.85',
'55.45',
'361.29',
'265.0',
'22.46',
'22.46',
'15.0',
'15.0',
'26.71',
'26.71',
'104.0',
'104.0',
'32.3',
'32.3',
'22.51',
'22.51',
'13.0',
'13.0',
'9.59',
'9.59',
'15.0',
'15.0',
'15.0',
'15.0'],
'cost':['26.11',
'39.23',
'254.99',
'187.09',
'16.0',
'16.0',
'10.7',
'10.7',
'19.0',
'19.0',
'73.46',
'73.46',
'23.0',
'23.0',
'16.0',
'16.0',
'9.29',
'9.29',
'7.0',
'7.0',
'10.7',
'10.7',
'10.7',
'10.7']
}
df = pd.DataFrame(data=data)
I generated my dataframe by appending last weeks catalog to the bottom of this week. I only want to do something with the 'inventory number's that have changed, or i want the delta. I had figured i could append the two, make sure that they were the same data type, re-index, and drop duplicates, but when I write to CSV to QA, there are still thousands of duplicates.
here is my code:
_import['inventory number'] = _import['inventory number'].str.encode('utf-8')
ts_data['inventory number'] = ts_data['inventory number'].str.encode('utf-8')
overlap = overlap.append(ts_data, ignore_index=True)
overlap_dedupe = overlap[overlap.duplicated(['inventory number','minimum price','cost'],keep=False)==False]
I have also tried:
overlap_dedupe = overlap.drop_duplicates(keep=False)
So, I know i'm running into some kind of encoding issue, because now i'm getting no duplicates.
combined.head(50).duplicated()
returns:
42736 False
32567 False
43033 False
33212 False
46592 False
46023 False
32568 False
33520 False
32756 False
26741 False
46133 False
42737 False
42480 False
40227 False
40562 False
49623 False
27712 False
31848 False
49794 False
27296 False
38198 False
35674 False
27907 False
22210 False
40563 False
18025 False
49624 False
18138 False
19357 False
43698 False
24398 False
50566 False
22276 False
38382 False
20507 False
43550 False
18150 False
29968 False
19247 False
47706 False
19248 False
43955 False
20731 False
38199 False
44168 False
17580 False
15944 False
44891 False
28327 False
16027 False
dtype: bool
drop_duplicates
duplicated
These are sister functions that work well together.
using your df
df = pd.read_json(
''.join(
['[[26.11,"WL-SMART-INWALL",36.85],[39.23,"WL-NMDISH-22",55.45',
'],[73.46,"UKS\\/99727\\/AHD6502TU3CBK",104.0],[73.46,"UKS\\/997',
'27\\/AHD6502TU3CBK",104.0],[23.0,"UKS\\/99725\\/PMK01",32.3],[2',
'3.0,"UKS\\/99725\\/PMK01",32.3],[16.0,"UKS\\/99294\\/A3L791R15MS',
'",22.51],[16.0,"UKS\\/99294\\/A3L791R15MS",22.51],[9.29,"UKS\\/',
'98865\\/58018251",13.0],[9.29,"UKS\\/98865\\/58018251",13.0],[7',
'.0,"UKS\\/98509\\/90Q653AN1N0N2UA0",9.59],[7.0,"UKS\\/98509\\/90',
'Q653AN1N0N2UA0",9.59],[254.99,"WL-MPS546-MESH",361.29],[10.7',
',"UKS\\/97771\\/FIBLCSC2",15.0],[10.7,"UKS\\/97771\\/FIBLCSC2",1',
'5.0],[10.7,"UKS\\/97627\\/FIBLCLC1",15.0],[10.7,"UKS\\/97627\\/F',
'IBLCLC1",15.0],[187.09,"WAS-WG-500P",265.0],[16.0,"UKS\\/9975',
'7\\/69975",22.46],[16.0,"UKS\\/99757\\/69975",22.46],[10.7,"UKS',
'\\/99750\\/S26361F2293L10",15.0],[10.7,"UKS\\/99750\\/S26361F229',
'3L10",15.0],[19.0,"UKS\\/99733\\/69973",26.71],[19.0,"UKS\\/997',
'33\\/69973",26.71]]']
)
)
We can clearly see that there are duplicates with
df.duplicated()
0 False
1 False
2 False
3 True
4 False
5 True
6 False
7 True
8 False
9 True
10 False
11 True
12 False
13 False
14 True
15 False
16 True
17 False
18 False
19 True
20 False
21 True
22 False
23 True
dtype: bool
Because we did not pass the keep
parameter, we assumed the default which is keep='first'
. This implies that each True
in this series indicates a row that is a duplicate of another row above it whose duplicated
status is False
.
We could short cut this and just get back an answer of whether or not there exist duplicates with
df.duplicated().any()
True
We can verify that drop_duplicates
does anything by chaining our handy test for duplicates after invoking drop_duplicates
df.drop_duplicates().duplicated().any()
False
Great! It worked.
This can preserved with
df = df.drop_duplicates()
df
0 1 2
0 26.11 WL-SMART-INWALL 36.85
1 39.23 WL-NMDISH-22 55.45
2 73.46 UKS/99727/AHD6502TU3CBK 104.00
4 23.00 UKS/99725/PMK01 32.30
6 16.00 UKS/99294/A3L791R15MS 22.51
8 9.29 UKS/98865/58018251 13.00
10 7.00 UKS/98509/90Q653AN1N0N2UA0 9.59
12 254.99 WL-MPS546-MESH 361.29
13 10.70 UKS/97771/FIBLCSC2 15.00
15 10.70 UKS/97627/FIBLCLC1 15.00
17 187.09 WAS-WG-500P 265.00
18 16.00 UKS/99757/69975 22.46
20 10.70 UKS/99750/S26361F2293L10 15.00
22 19.00 UKS/99733/69973 26.71
Just to make sure
df.duplicated().any()
False
Conclusion
It worked fine for me. Hopefully this demonstration will help you clear up whatever is going wrong for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With