Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently Removing Very-Near-Duplicates From Python List

Background:
My Python program handles relatively large quantities of data, which can be generated in-program, or imported. The data is then processed, and during one of these processes, the data is deliberately copied and then manipulated, cleaned for duplicates and then returned to the program for further use. The data I'm handling is very precise (up to 16 decimal places), and maintaining this accuracy to at least 14dp is vital. However, mathematical operations of course can return slight variations in my floats, such that two values are identical to 14dp, but may vary ever so slightly to 16dp, therefore meaning the built in set() function doesn't correctly remove such 'duplicates' (I used this method to prototype the idea, but it's not satisfactory for the finished program). I should also point out I may well be overlooking something simple! I am just interested to see what others come up with :)

Question:
What is the most efficient way to remove very-near-duplicates from a potentially very large data set?

My Attempts:
I have tried rounding the values themselves to 14dp, but this is of course not satisfactory as this leads to larger errors down the line. I have a potential solution to this problem, but I am not convinced it is as efficient or 'pythonic' as possible. My attempt involves finding the indices of list entries that match to x dp, and then removing one of the matching entries.

Thank you in advance for any advice! Please let me know if there's anything you wish to be clarified, or of course if I'm overlooking something very simple (I may be at a point where I'm over-thinking it).

Clarification on 'Duplicates':
Example of one of my 'duplicate' entries: 603.73066958946424, 603.73066958946460, the solution would remove one of these values.

Note on decimal.Decimal:
This could work if it was guaranteed that all imported data did not already have some near-duplicates (which it often does).

like image 849
MarkyD43 Avatar asked Sep 14 '13 14:09

MarkyD43


People also ask

How do I remove multiple duplicates from a list in Python?

If the order of the elements is not critical, we can remove duplicates using the Set method and the Numpy unique() function. We can use Pandas functions, OrderedDict, reduce() function, Set + sort() method, and iterative approaches to keep the order of elements.

How do I remove a repeating value from a list?

Remove duplicates from list using Set. To remove the duplicates from a list, you can make use of the built-in function set(). The specialty of set() method is that it returns distinct elements. We have a list : [1,1,2,3,2,2,4,5,6,2,1].

How do you remove duplicates from a consecutive list in Python?

Using the groupby function, we can group the together occurring elements as one and can remove all the duplicates in succession and just let one element be in the list. This function can be used to keep the element and delete the successive elements with the use of slicing.


1 Answers

You really want to use NumPy if you're handling large quantities of data. Here's how I would do it :

Import NumPy :

import numpy as np

Generate 8000 high-precision floats (128-bits will be enough for your purposes, but note that I'm converting the 64-bits output of random to 128 just to fake it. Use your real data here.) :

a = np.float128(np.random.random((8000,)))

Find the indexes of the unique elements in the rounded array :

_, unique = np.unique(a.round(decimals=14), return_index=True)

And take those indexes from the original (non-rounded) array :

no_duplicates = a[unique]
like image 55
F.X. Avatar answered Sep 19 '22 11:09

F.X.