I have a list "records" like this <pre class="prettyprint"><code>data = [ {'id':1, 'name': 'A', 'price': 10, 'url': 'foo'}, {'id':2, 'name': 'A', 'price': 20, 'url': 'bar'}, {'id':3, 'name': 'A', 'price': 30, 'url': 'baz'}, {'id':4, 'name': 'A', 'price': 10, 'url': 'baz'}, {'id':5, 'name': 'A', 'price': 20, 'url': 'bar'}, {'id':6, 'name': 'A', 'price': 30, 'url': 'foo'}, {'id':7, 'name': 'A', 'price': 99, 'url': 'quu'}, {'id':8, 'name': 'B', 'price': 10, 'url': 'foo'}, ] </code></pre> I want to remove records that are "duplicates", where equality is defined by a list of logical conditions. Each element in the list is an OR condition, and all elements are ANDed together. For example: <pre class="prettyprint"><code>filters = [ ['name'], ['price', 'url'] ] </code></pre> means that two records are considered equal if their name AND (their price OR url) are equal. For the above example: <pre class="prettyprint"><code>For item 1 the duplicates are 4 (by name and price) and 6 (name+url) For item 2 - 5 (name+price, name+url) For item 3 - 4 (name+url) and 6 (name+price) For item 7 there are no duplicates (neither price nor url match) For item 8 there are no duplicates (name doesn't match) </code></pre> So the resulting list must contain items 1, 2, 3, 7 and 8. Please take into account that <ul> <li>there might be more AND conditions: <code>['name'], ['price', 'url'], ['weight'], ['size'], ...</code> </li> <li>the OR groups in the conditions list can be longer than 2 items, e.g. <code>['name'], ['price', 'url', 'weight']...</code> </li> <li>the source list is very long, an <code>O(n^2)</code> alogirthm is out of the question</li> </ul>

Based on the idea of Tim Pietzcker, the following is working for me: We start by converting the CNF condition like <code>a&(b|c)</code> into a DNF: <code>(a&b)|(a&c)</code>. Using the list notation as in the question, i.e <code>[ [a], [b, c] ]</code>, the DNF will be <code>[ [a, b], [a, c] ]</code>. In python this is as simple as <code>itertools.product(*filters)</code>. Then we iterate over the list and for each conjunct in the DNF create a composite key: <pre class="prettyprint"><code>( (a, rec[a]), (b, rec[b]) ) </code></pre> and check if any of the keys have been seen already. If not, we consider the record to be unique and add its keys to the <code>seen</code> set: The code: <pre class="prettyprint"><code>seen = set() dnf = list(itertools.product(*filters)) for item in data: keys = set( tuple((field, item.get(field, None)) for field in conjunct) for conjunct in dnf) if keys.isdisjoint(seen): seen |= keys print item # unique </code></pre> Kudos to Tim for giving me an idea. If anyone sees any problems with this solution, please let me know.

Comparing dictionaries based on a combination of keys

Q: Can two dictionaries be compared?

Using Loop to Compare Two Dictionaries Here we are checking the equality of two dictionaries by iterating through one of the dictionaries keys using for loop and checking for the same keys in the other dictionaries.

Q: Can we compare 2 dictionaries in Python?

dicts in python are also classes. These have the __eq__method overridden, so you can use the == operator to check if 2 dictionaries are equal or not.

Q: How do I compare a list of dictionaries?

In the lists of dictionaries you need to compare each dict in your first list to each list in your second dict if there is no structure in the dictionaries. If you have a fixed structure, sort the list of dictionaries according to some key value pair and then start eliminating the different ones.

Tags:

python

dictionary

algorithm

I have a list "records" like this

data = [
    {'id':1, 'name': 'A', 'price': 10, 'url': 'foo'},
    {'id':2, 'name': 'A', 'price': 20, 'url': 'bar'},
    {'id':3, 'name': 'A', 'price': 30, 'url': 'baz'},
    {'id':4, 'name': 'A', 'price': 10, 'url': 'baz'},
    {'id':5, 'name': 'A', 'price': 20, 'url': 'bar'},
    {'id':6, 'name': 'A', 'price': 30, 'url': 'foo'},
    {'id':7, 'name': 'A', 'price': 99, 'url': 'quu'},
    {'id':8, 'name': 'B', 'price': 10, 'url': 'foo'},
]

I want to remove records that are "duplicates", where equality is defined by a list of logical conditions. Each element in the list is an OR condition, and all elements are ANDed together. For example:

filters = [  ['name'],   ['price', 'url']  ]

means that two records are considered equal if their name AND (their price OR url) are equal. For the above example:

For item 1 the duplicates are 4 (by name and price) and 6 (name+url)
For item 2 - 5 (name+price, name+url)
For item 3 - 4 (name+url) and 6 (name+price)
For item 7 there are no duplicates (neither price nor url match)
For item 8 there are no duplicates (name doesn't match)

So the resulting list must contain items 1, 2, 3, 7 and 8.

Please take into account that

there might be more AND conditions: ['name'], ['price', 'url'], ['weight'], ['size'], ...
the OR groups in the conditions list can be longer than 2 items, e.g. ['name'], ['price', 'url', 'weight']...
the source list is very long, an O(n^2) alogirthm is out of the question

378

asked Dec 16 '13 21:12

georg

2 Answers

The way to avoid doing this in O(n^2) time is to build an index for each query you want to do. Once you've got the machinery to query any value in constant time, your O(n^2) turns into O(n), trivially. And you can build all of the indices in O(n) time, too.

Assuming each of your values has the same fields, it will look like this:

indices = defaultdict(lambda: defaultdict(set))
for i, row in enumerate(data):
    for field in 'id', 'name', 'price', 'url':
        key = row[field]
        indices[field][key].add(i)

Now, to search for a specific value, it's just this:

def search(field, key):
    return (data[index] for index in indices[field][key])

To search for a group of values ored together, just search them separately and set.union them together, like this:

def search_disj(factors):
    sets = (indices[field][key] for field, key in factors)
    return (data[index] for index in reduce(set.union, sets))

And to search for a group of disjunctions anded together, do the same thing for each one, and then set.intersection all of the results together.

Depending on your data, it can be more efficient to just look up the first index, then linearly search the results for the other factors. You can optimize that further by reordering fields so you search the one with the smallest len(indices[field]) first. (Or, in this case, the one with the smallest sum(len(indices[field]) for field in disj)`.)

If you can arbitrary nesting——conjunctions of disjunctions of conjunctions of … until you get down to single elements—you'll just have the functions call either other mutually recursively (with a base case for flat elements). You can even extend this to fully generally boolean search (although you'll also need a not operation—universe - indices[field][key], where universe = set(range(len(data)))—for that).

If the data are very large, you may not be able to store all the indices in memory.

Or, even if you can stores all the indices in memory, cache and even page misses may make a hash table less than ideal, in which case you'll probably want to consider something based on a B-tree (e.g., blist.sorteddict) instead of a dict. This also gives you the advantage that you can search for ranges of values, order results, etc. The disadvantage is that all those n times become n log n, but if you need the functionality, or if you get a two-orders-of-magnitude locality benefit in exchange for a log(n, base) cost that turns out to be only 7, it can be worth it.

Or, alternatively, use some sort of disk-backed dict-like storage, like an anydbm.

However, really, what you're building is a relational database with only a single relation (table). In many cases, you'll be better off just using an off-the-shelf relational database—like sqlite3, which Python comes with built in. Then the code to build an index just looks like this:

db.execute('CREATE INDEX id_idx ON data (id)')

… and you can just do queries and they magically use the right indices in the best possible way:

curs = db.execute('SELECT * FROM data WHERE name = ? AND (price = ? OR url = ?)', 
                  filters)

178

answered Oct 30 '22 11:10

abarnert

Based on the idea of Tim Pietzcker, the following is working for me:

We start by converting the CNF condition like a&(b|c) into a DNF: (a&b)|(a&c). Using the list notation as in the question, i.e [ [a], [b, c] ], the DNF will be [ [a, b], [a, c] ]. In python this is as simple as itertools.product(*filters).

Then we iterate over the list and for each conjunct in the DNF create a composite key:

( (a, rec[a]), (b, rec[b]) )

and check if any of the keys have been seen already. If not, we consider the record to be unique and add its keys to the seen set:

The code:

seen = set()
dnf = list(itertools.product(*filters))

for item in data:
    keys = set(
        tuple((field, item.get(field, None)) for field in conjunct) 
        for conjunct in dnf)
    if keys.isdisjoint(seen):
        seen |= keys
        print item # unique

Kudos to Tim for giving me an idea. If anyone sees any problems with this solution, please let me know.

answered Oct 30 '22 12:10

georg

Related questions
                            
                                numpy array of strings indexing behavior
                            
                                using an alternate /tmp location with pyinstaller
                            
                                Plot cross section through heat map
                            
                                Python Tkinter Spinbox with float
                            
                                How to pass arguments from tornado to a js file but not html?
                            
                                How to save pygame Surface as an image to memory (and not to disk)
                            
                                PyInstaller very big file size
                            
                                How to plot a probability mass function in python
                            
                                Create a multiframe .tif file
                            
                                Pandas GroupBy by Element and everything else
                            
                                matplotlib: Plot numpy arrays with None as values
                            
                                Build an approximately uniform grid from random sample (python)
                            
                                Compile latex from python
                            
                                unhashable type: dict with test JSON
                            
                                Misunderstood python yield
                            
                                Passing numpy array to Cython
                            
                                Is there a Python idiom for creating dot folders in home directory? [closed]
                            
                                FormDataRoutingRedirect exception from a URL without trainling slash
                            
                                What is pythonic way to do dt[,y:=myfun(x),by=list(a,b,c)] in R?
                            
                                Pre-fill new functions in Eclipse and Pydev with docstring and "Not Implemented" exception

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Comparing dictionaries based on a combination of keys

Tags:

python

dictionary

algorithm

georg

People also ask

2 Answers

abarnert

georg

Recent Activity

Donate For Us