Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python PANDAS df.duplicated and df.drop_duplicated not finding all duplicates

Tags:

python

pandas

I am having trouble using duplicated or drop_duplicates to find / remove all duplicates from a dataframe.

My data looks something like this, however, my data is 52k rows long.

data = {'inventory number':['WL-SMART-INWALL',
                         'WL-NMDISH-22',
                         'WL-MPS546-MESH',
                         'WAS-WG-500P',
                         'UKS/99757/69975',
                         'UKS/99757/69975',
                         'UKS/99750/S26361F2293L10',
                         'UKS/99750/S26361F2293L10',
                         'UKS/99733/69973',
                         'UKS/99733/69973',
                         'UKS/99727/AHD6502TU3CBK',
                         'UKS/99727/AHD6502TU3CBK',
                         'UKS/99725/PMK01',
                         'UKS/99725/PMK01',
                         'UKS/99294/A3L791R15MS',
                         'UKS/99294/A3L791R15MS',
                         'UKS/98865/58018251',
                         'UKS/98865/58018251',
                         'UKS/98509/90Q653AN1N0N2UA0',
                         'UKS/98509/90Q653AN1N0N2UA0',
                         'UKS/97771/FIBLCSC2',
                         'UKS/97771/FIBLCSC2',
                         'UKS/97627/FIBLCLC1',
                         'UKS/97627/FIBLCLC1'],
        'minimum price': ['36.85',
                         '55.45',
                         '361.29',
                         '265.0',
                         '22.46',
                         '22.46',
                         '15.0',
                         '15.0',
                         '26.71',
                         '26.71',
                         '104.0',
                         '104.0',
                         '32.3',
                         '32.3',
                         '22.51',
                         '22.51',
                         '13.0',
                         '13.0',
                         '9.59',
                         '9.59',
                         '15.0',
                         '15.0',
                         '15.0',
                         '15.0'],
    'cost':['26.11',
                         '39.23',
                         '254.99',
                         '187.09',
                         '16.0',
                         '16.0',
                         '10.7',
                         '10.7',
                         '19.0',
                         '19.0',
                         '73.46',
                         '73.46',
                         '23.0',
                         '23.0',
                         '16.0',
                         '16.0',
                         '9.29',
                         '9.29',
                         '7.0',
                         '7.0',
                         '10.7',
                         '10.7',
                         '10.7',
                         '10.7']
   }
df = pd.DataFrame(data=data)

I generated my dataframe by appending last weeks catalog to the bottom of this week. I only want to do something with the 'inventory number's that have changed, or i want the delta. I had figured i could append the two, make sure that they were the same data type, re-index, and drop duplicates, but when I write to CSV to QA, there are still thousands of duplicates.

here is my code:

_import['inventory number'] = _import['inventory number'].str.encode('utf-8')
ts_data['inventory number'] = ts_data['inventory number'].str.encode('utf-8')
overlap = overlap.append(ts_data, ignore_index=True)
overlap_dedupe = overlap[overlap.duplicated(['inventory  number','minimum price','cost'],keep=False)==False]

I have also tried:

overlap_dedupe = overlap.drop_duplicates(keep=False)

So, I know i'm running into some kind of encoding issue, because now i'm getting no duplicates.

combined.head(50).duplicated()

returns:

42736    False
32567    False
43033    False
33212    False
46592    False
46023    False
32568    False
33520    False
32756    False
26741    False
46133    False
42737    False
42480    False
40227    False
40562    False
49623    False
27712    False
31848    False
49794    False
27296    False
38198    False
35674    False
27907    False
22210    False
40563    False
18025    False
49624    False
18138    False
19357    False
43698    False
24398    False
50566    False
22276    False
38382    False
20507    False
43550    False
18150    False
29968    False
19247    False
47706    False
19248    False
43955    False
20731    False
38199    False
44168    False
17580    False
15944    False
44891    False
28327    False
16027    False
dtype: bool
like image 939
Yale Newman Avatar asked Oct 30 '22 15:10

Yale Newman


1 Answers

drop_duplicates
enter image description here


duplicated
enter image description here


These are sister functions that work well together.

using your df

df = pd.read_json(
    ''.join(
        ['[[26.11,"WL-SMART-INWALL",36.85],[39.23,"WL-NMDISH-22",55.45',
         '],[73.46,"UKS\\/99727\\/AHD6502TU3CBK",104.0],[73.46,"UKS\\/997',
         '27\\/AHD6502TU3CBK",104.0],[23.0,"UKS\\/99725\\/PMK01",32.3],[2',
         '3.0,"UKS\\/99725\\/PMK01",32.3],[16.0,"UKS\\/99294\\/A3L791R15MS',
         '",22.51],[16.0,"UKS\\/99294\\/A3L791R15MS",22.51],[9.29,"UKS\\/',
         '98865\\/58018251",13.0],[9.29,"UKS\\/98865\\/58018251",13.0],[7',
         '.0,"UKS\\/98509\\/90Q653AN1N0N2UA0",9.59],[7.0,"UKS\\/98509\\/90',
         'Q653AN1N0N2UA0",9.59],[254.99,"WL-MPS546-MESH",361.29],[10.7',
         ',"UKS\\/97771\\/FIBLCSC2",15.0],[10.7,"UKS\\/97771\\/FIBLCSC2",1',
         '5.0],[10.7,"UKS\\/97627\\/FIBLCLC1",15.0],[10.7,"UKS\\/97627\\/F',
         'IBLCLC1",15.0],[187.09,"WAS-WG-500P",265.0],[16.0,"UKS\\/9975',
         '7\\/69975",22.46],[16.0,"UKS\\/99757\\/69975",22.46],[10.7,"UKS',
         '\\/99750\\/S26361F2293L10",15.0],[10.7,"UKS\\/99750\\/S26361F229',
         '3L10",15.0],[19.0,"UKS\\/99733\\/69973",26.71],[19.0,"UKS\\/997',
         '33\\/69973",26.71]]']
    )
)

We can clearly see that there are duplicates with

df.duplicated()

0     False
1     False
2     False
3      True
4     False
5      True
6     False
7      True
8     False
9      True
10    False
11     True
12    False
13    False
14     True
15    False
16     True
17    False
18    False
19     True
20    False
21     True
22    False
23     True
dtype: bool

Because we did not pass the keep parameter, we assumed the default which is keep='first'. This implies that each True in this series indicates a row that is a duplicate of another row above it whose duplicated status is False.

We could short cut this and just get back an answer of whether or not there exist duplicates with

df.duplicated().any()

True

We can verify that drop_duplicates does anything by chaining our handy test for duplicates after invoking drop_duplicates

df.drop_duplicates().duplicated().any()

False

Great! It worked.
This can preserved with

df =  df.drop_duplicates()
df

         0                           1       2
0    26.11             WL-SMART-INWALL   36.85
1    39.23                WL-NMDISH-22   55.45
2    73.46     UKS/99727/AHD6502TU3CBK  104.00
4    23.00             UKS/99725/PMK01   32.30
6    16.00       UKS/99294/A3L791R15MS   22.51
8     9.29          UKS/98865/58018251   13.00
10    7.00  UKS/98509/90Q653AN1N0N2UA0    9.59
12  254.99              WL-MPS546-MESH  361.29
13   10.70          UKS/97771/FIBLCSC2   15.00
15   10.70          UKS/97627/FIBLCLC1   15.00
17  187.09                 WAS-WG-500P  265.00
18   16.00             UKS/99757/69975   22.46
20   10.70    UKS/99750/S26361F2293L10   15.00
22   19.00             UKS/99733/69973   26.71

Just to make sure

df.duplicated().any()

False

Conclusion
It worked fine for me. Hopefully this demonstration will help you clear up whatever is going wrong for you.

like image 129
piRSquared Avatar answered Nov 15 '22 06:11

piRSquared