Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fast data comparison in python

I want to compare a large set of data in the form of 2 dictionaries of varying lengths. (edit)

post = {0: [0.96180319786071777, 0.37529754638671875], 
        10: [0.20612385869026184, 0.17849941551685333],
        20: [0.20612400770187378, 0.17510984838008881],...}

pre = {0: [0.96180319786071777, 0.37529754638671875],
       1: [0.20612385869026184, 0.17849941551685333],
       2: [0.20612400770187378, 0.17510984838008881],
       5065: [0.80861318111419678, 0.76381617784500122],...}

The answer we need to get is 5065: [0.80861318111419678, 0.76381617784500122]. This is based on the fact that we are only comparing the values and not the indices at all.

I am using this key value pair only to remember the sequence of data. The data type can be replaced with a list/set if need be. I need to find out the key:value (index and value) pairs of the elements that are not in common to the dictionaries.

The code that I am using is very simple..

new = {}
found = []

for i in range(0, len(post)): 
    found= []
    for j in range(0, len(pre)): 
        if post[i] not in pre.values():
            if post[i] not in new:
                new[i] = post[i]
                found.append(j)             
                break
    if found:
        for f in found: pre.pop(f)

new{} contains the elements I need. The problem I am facing is that this process is too slow. It takes sometimes over an hour to process. The data can be much larger at times. I need it to be faster.

Is there an efficient way of doing what I am trying to achieve ? I would like it if we dont depend on external packages apart from those bundled with python 2.5 (64 bit) unless absolutely necessary.

Thank you all.

like image 493
sbetharia Avatar asked Nov 27 '25 13:11

sbetharia


1 Answers

This is basically what sets are designed for (computing differences in sets of items). The only gotcha is that the things you put into a set need to be hashable, and lists aren't. However, tuples are, so if you convert to that, you can put those into a set:

post_set = set(tuple(x) for x in post.itervalues())
pre_set = set(tuple(x) for x in pre.itervalues())

items_in_only_one_set = post_set ^ pre_set

For more about sets: http://docs.python.org/library/stdtypes.html#set

To get the original indices after you've computed the differences, what you'd probably want is to generate reverse lookup tables:

post_indices = dict((tuple(v),k) for k,v in post.iteritems())
pre_indices = dict((tuple(v),k) for k,v in pre.iteritems())

Then you can just take a given tuple and look up its index via the dictionaries:

index = post_indices.get(a_tuple, pre_indices.get(a_tuple))
like image 189
Amber Avatar answered Nov 29 '25 02:11

Amber



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!