fast data comparison in python

Question

I want to compare a large set of data in the form of 2 dictionaries of varying lengths. (edit)

post = {0: [0.96180319786071777, 0.37529754638671875], 
        10: [0.20612385869026184, 0.17849941551685333],
        20: [0.20612400770187378, 0.17510984838008881],...}

pre = {0: [0.96180319786071777, 0.37529754638671875],
       1: [0.20612385869026184, 0.17849941551685333],
       2: [0.20612400770187378, 0.17510984838008881],
       5065: [0.80861318111419678, 0.76381617784500122],...}

The answer we need to get is 5065: [0.80861318111419678, 0.76381617784500122]. This is based on the fact that we are only comparing the values and not the indices at all.

I am using this key value pair only to remember the sequence of data. The data type can be replaced with a list/set if need be. I need to find out the key:value (index and value) pairs of the elements that are not in common to the dictionaries.

The code that I am using is very simple..

new = {}
found = []

for i in range(0, len(post)): 
    found= []
    for j in range(0, len(pre)): 
        if post[i] not in pre.values():
            if post[i] not in new:
                new[i] = post[i]
                found.append(j)             
                break
    if found:
        for f in found: pre.pop(f)

new{} contains the elements I need. The problem I am facing is that this process is too slow. It takes sometimes over an hour to process. The data can be much larger at times. I need it to be faster.

Is there an efficient way of doing what I am trying to achieve ? I would like it if we dont depend on external packages apart from those bundled with python 2.5 (64 bit) unless absolutely necessary.

Thank you all.

Amber · Accepted Answer

This is basically what sets are designed for (computing differences in sets of items). The only gotcha is that the things you put into a set need to be hashable, and lists aren't. However, tuples are, so if you convert to that, you can put those into a set:

post_set = set(tuple(x) for x in post.itervalues())
pre_set = set(tuple(x) for x in pre.itervalues())

items_in_only_one_set = post_set ^ pre_set

For more about sets: http://docs.python.org/library/stdtypes.html#set

To get the original indices after you've computed the differences, what you'd probably want is to generate reverse lookup tables:

post_indices = dict((tuple(v),k) for k,v in post.iteritems())
pre_indices = dict((tuple(v),k) for k,v in pre.iteritems())

Then you can just take a given tuple and look up its index via the dictionaries:

index = post_indices.get(a_tuple, pre_indices.get(a_tuple))

fast data comparison in python

Tags:

python

dictionary

list

comparison

sbetharia

1 Answers

Amber

Recent Activity

Donate For Us

fast data comparison in python

Tags:

python

dictionary

list

comparison

sbetharia

1 Answers

Amber

Related questions

Recent Activity

Donate For Us