Background:
My Python program handles relatively large quantities of data, which can be generated in-program, or imported. The data is then processed, and during one of these processes, the data is deliberately copied and then manipulated, cleaned for duplicates and then returned to the program for further use. The data I'm handling is very precise (up to 16 decimal places), and maintaining this accuracy to at least 14dp is vital. However, mathematical operations of course can return slight variations in my floats, such that two values are identical to 14dp, but may vary ever so slightly to 16dp, therefore meaning the built in set()
function doesn't correctly remove such 'duplicates' (I used this method to prototype the idea, but it's not satisfactory for the finished program). I should also point out I may well be overlooking something simple! I am just interested to see what others come up with :)
Question:
What is the most efficient way to remove very-near-duplicates from a potentially very large data set?
My Attempts:
I have tried rounding the values themselves to 14dp, but this is of course not satisfactory as this leads to larger errors down the line. I have a potential solution to this problem, but I am not convinced it is as efficient or 'pythonic' as possible. My attempt involves finding the indices of list entries that match to x dp, and then removing one of the matching entries.
Thank you in advance for any advice! Please let me know if there's anything you wish to be clarified, or of course if I'm overlooking something very simple (I may be at a point where I'm over-thinking it).
Clarification on 'Duplicates':
Example of one of my 'duplicate' entries: 603.73066958946424, 603.73066958946460, the solution would remove one of these values.
Note on decimal.Decimal:
This could work if it was guaranteed that all imported data did not already have some near-duplicates (which it often does).
If the order of the elements is not critical, we can remove duplicates using the Set method and the Numpy unique() function. We can use Pandas functions, OrderedDict, reduce() function, Set + sort() method, and iterative approaches to keep the order of elements.
Remove duplicates from list using Set. To remove the duplicates from a list, you can make use of the built-in function set(). The specialty of set() method is that it returns distinct elements. We have a list : [1,1,2,3,2,2,4,5,6,2,1].
Using the groupby function, we can group the together occurring elements as one and can remove all the duplicates in succession and just let one element be in the list. This function can be used to keep the element and delete the successive elements with the use of slicing.
You really want to use NumPy if you're handling large quantities of data. Here's how I would do it :
Import NumPy :
import numpy as np
Generate 8000 high-precision floats (128-bits will be enough for your purposes, but note that I'm converting the 64-bits output of random
to 128 just to fake it. Use your real data here.) :
a = np.float128(np.random.random((8000,)))
Find the indexes of the unique elements in the rounded array :
_, unique = np.unique(a.round(decimals=14), return_index=True)
And take those indexes from the original (non-rounded) array :
no_duplicates = a[unique]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With