Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing dictionaries based on a combination of keys

I have a list "records" like this

data = [
    {'id':1, 'name': 'A', 'price': 10, 'url': 'foo'},
    {'id':2, 'name': 'A', 'price': 20, 'url': 'bar'},
    {'id':3, 'name': 'A', 'price': 30, 'url': 'baz'},
    {'id':4, 'name': 'A', 'price': 10, 'url': 'baz'},
    {'id':5, 'name': 'A', 'price': 20, 'url': 'bar'},
    {'id':6, 'name': 'A', 'price': 30, 'url': 'foo'},
    {'id':7, 'name': 'A', 'price': 99, 'url': 'quu'},
    {'id':8, 'name': 'B', 'price': 10, 'url': 'foo'},
]

I want to remove records that are "duplicates", where equality is defined by a list of logical conditions. Each element in the list is an OR condition, and all elements are ANDed together. For example:

filters = [  ['name'],   ['price', 'url']  ]

means that two records are considered equal if their name AND (their price OR url) are equal. For the above example:

For item 1 the duplicates are 4 (by name and price) and 6 (name+url)
For item 2 - 5 (name+price, name+url)
For item 3 - 4 (name+url) and 6 (name+price)
For item 7 there are no duplicates (neither price nor url match)
For item 8 there are no duplicates (name doesn't match)

So the resulting list must contain items 1, 2, 3, 7 and 8.

Please take into account that

  • there might be more AND conditions: ['name'], ['price', 'url'], ['weight'], ['size'], ...
  • the OR groups in the conditions list can be longer than 2 items, e.g. ['name'], ['price', 'url', 'weight']...
  • the source list is very long, an O(n^2) alogirthm is out of the question
like image 378
georg Avatar asked Dec 16 '13 21:12

georg


People also ask

How do you compare keys in a dictionary?

The compare method cmp() is used in Python to compare values and keys of two dictionaries. If method returns 0 if both dictionaries are equal, 1 if dic1 > dict2 and -1 if dict1 < dict2.

Can two dictionaries be compared?

Using Loop to Compare Two Dictionaries Here we are checking the equality of two dictionaries by iterating through one of the dictionaries keys using for loop and checking for the same keys in the other dictionaries.

Can we compare 2 dictionaries in Python?

dicts in python are also classes. These have the __eq__method overridden, so you can use the == operator to check if 2 dictionaries are equal or not.

How do I compare a list of dictionaries?

In the lists of dictionaries you need to compare each dict in your first list to each list in your second dict if there is no structure in the dictionaries. If you have a fixed structure, sort the list of dictionaries according to some key value pair and then start eliminating the different ones.


2 Answers

The way to avoid doing this in O(n^2) time is to build an index for each query you want to do. Once you've got the machinery to query any value in constant time, your O(n^2) turns into O(n), trivially. And you can build all of the indices in O(n) time, too.

Assuming each of your values has the same fields, it will look like this:

indices = defaultdict(lambda: defaultdict(set))
for i, row in enumerate(data):
    for field in 'id', 'name', 'price', 'url':
        key = row[field]
        indices[field][key].add(i)

Now, to search for a specific value, it's just this:

def search(field, key):
    return (data[index] for index in indices[field][key])

To search for a group of values ored together, just search them separately and set.union them together, like this:

def search_disj(factors):
    sets = (indices[field][key] for field, key in factors)
    return (data[index] for index in reduce(set.union, sets))

And to search for a group of disjunctions anded together, do the same thing for each one, and then set.intersection all of the results together.

Depending on your data, it can be more efficient to just look up the first index, then linearly search the results for the other factors. You can optimize that further by reordering fields so you search the one with the smallest len(indices[field]) first. (Or, in this case, the one with the smallest sum(len(indices[field]) for field in disj)`.)

If you can arbitrary nesting——conjunctions of disjunctions of conjunctions of … until you get down to single elements—you'll just have the functions call either other mutually recursively (with a base case for flat elements). You can even extend this to fully generally boolean search (although you'll also need a not operation—universe - indices[field][key], where universe = set(range(len(data)))—for that).


If the data are very large, you may not be able to store all the indices in memory.

Or, even if you can stores all the indices in memory, cache and even page misses may make a hash table less than ideal, in which case you'll probably want to consider something based on a B-tree (e.g., blist.sorteddict) instead of a dict. This also gives you the advantage that you can search for ranges of values, order results, etc. The disadvantage is that all those n times become n log n, but if you need the functionality, or if you get a two-orders-of-magnitude locality benefit in exchange for a log(n, base) cost that turns out to be only 7, it can be worth it.

Or, alternatively, use some sort of disk-backed dict-like storage, like an anydbm.


However, really, what you're building is a relational database with only a single relation (table). In many cases, you'll be better off just using an off-the-shelf relational database—like sqlite3, which Python comes with built in. Then the code to build an index just looks like this:

db.execute('CREATE INDEX id_idx ON data (id)')

… and you can just do queries and they magically use the right indices in the best possible way:

curs = db.execute('SELECT * FROM data WHERE name = ? AND (price = ? OR url = ?)', 
                  filters)
like image 178
abarnert Avatar answered Oct 30 '22 11:10

abarnert


Based on the idea of Tim Pietzcker, the following is working for me:

We start by converting the CNF condition like a&(b|c) into a DNF: (a&b)|(a&c). Using the list notation as in the question, i.e [ [a], [b, c] ], the DNF will be [ [a, b], [a, c] ]. In python this is as simple as itertools.product(*filters).

Then we iterate over the list and for each conjunct in the DNF create a composite key:

( (a, rec[a]), (b, rec[b]) )

and check if any of the keys have been seen already. If not, we consider the record to be unique and add its keys to the seen set:

The code:

seen = set()
dnf = list(itertools.product(*filters))

for item in data:
    keys = set(
        tuple((field, item.get(field, None)) for field in conjunct) 
        for conjunct in dnf)
    if keys.isdisjoint(seen):
        seen |= keys
        print item # unique

Kudos to Tim for giving me an idea. If anyone sees any problems with this solution, please let me know.

like image 34
georg Avatar answered Oct 30 '22 12:10

georg