Efficiently Removing Very-Near-Duplicates From Python List

Tags:

Background:
My Python program handles relatively large quantities of data, which can be generated in-program, or imported. The data is then processed, and during one of these processes, the data is deliberately copied and then manipulated, cleaned for duplicates and then returned to the program for further use. The data I'm handling is very precise (up to 16 decimal places), and maintaining this accuracy to at least 14dp is vital. However, mathematical operations of course can return slight variations in my floats, such that two values are identical to 14dp, but may vary ever so slightly to 16dp, therefore meaning the built in set() function doesn't correctly remove such 'duplicates' (I used this method to prototype the idea, but it's not satisfactory for the finished program). I should also point out I may well be overlooking something simple! I am just interested to see what others come up with :)

Question:
What is the most efficient way to remove very-near-duplicates from a potentially very large data set?

My Attempts:
I have tried rounding the values themselves to 14dp, but this is of course not satisfactory as this leads to larger errors down the line. I have a potential solution to this problem, but I am not convinced it is as efficient or 'pythonic' as possible. My attempt involves finding the indices of list entries that match to x dp, and then removing one of the matching entries.

Thank you in advance for any advice! Please let me know if there's anything you wish to be clarified, or of course if I'm overlooking something very simple (I may be at a point where I'm over-thinking it).

Clarification on 'Duplicates':
Example of one of my 'duplicate' entries: 603.73066958946424, 603.73066958946460, the solution would remove one of these values.

Note on decimal.Decimal:
This could work if it was guaranteed that all imported data did not already have some near-duplicates (which it often does).

849

asked Sep 14 '13 14:09

MarkyD43

1 Answers

You really want to use NumPy if you're handling large quantities of data. Here's how I would do it :

Import NumPy :

import numpy as np

Generate 8000 high-precision floats (128-bits will be enough for your purposes, but note that I'm converting the 64-bits output of random to 128 just to fake it. Use your real data here.) :

a = np.float128(np.random.random((8000,)))

Find the indexes of the unique elements in the rounded array :

_, unique = np.unique(a.round(decimals=14), return_index=True)

And take those indexes from the original (non-rounded) array :

no_duplicates = a[unique]

answered Sep 19 '22 11:09

F.X.

Related questions
                            
                                How to find multiline text between curly braces?
                            
                                how to get the integer value of a single pyserial byte in python
                            
                                How to limit number of concurrent threads in Python?
                            
                                how to create similarity matrix in numpy python?
                            
                                Finding the first commit on a branch with GitPython
                            
                                python urllib2 basic authentication
                            
                                Find if a number exists between a range of numbers specified by a list
                            
                                unnecessary exclamation marks(!)'s in HTML code
                            
                                print: "IOError: [Errno 9] Bad file descriptor"
                            
                                Matplotlib Pie Chart Labels Alignment
                            
                                Getting html tag value in python
                            
                                what's meaning of orphans in django's paginator?
                            
                                Autocomplete for OpenCV-Python in Windows not working
                            
                                Multivariate series expansion in sympy
                            
                                How to monkeypatch one class's instance method to another one?
                            
                                In python, how can I distinguish between a human readable word and a random string?
                            
                                how to update existing data frame in pandas?
                            
                                pandas retaining index column when using usecols
                            
                                Finding columns which contain dates in Pandas
                            
                                ImportError:libpng16.so.16 cannot open shared object file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently Removing Very-Near-Duplicates From Python List

Tags:

python

list

duplicates

python-2.7

MarkyD43

People also ask

1 Answers

F.X.

Recent Activity

Donate For Us