Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently find differences in elements of a list

Tags:

python

I'm trying to find the differences between an element of a list and all elements of a list for each element of that list. I think that's how I'd describe it? Let's use an example instead:

idx = [1, 4, 8, 10, 22]

Based on this question, it would be simple enough to just zip the elements together, but that would result in only one pairwise comparison per element, whereas I need to make multiple pairwise comparisons for each element.

My approach was to do the following:

diffs = [abs(i-j) for i in idx for j in idx]

But this is incredibly inefficient. What is the most efficient way to calculate the differences for all of the following tuples?

comparisons = [(1,4), (1,8), (1,10), (1, 22), 
               (4,8), (4,10), (4,22), 
               (8,10, 8,22)]

NOTE:

I would prefer answers using base Python 2.7, but would also like to know how to do this with Numpy, as I'm sure that package makes it easier.

like image 285
tblznbits Avatar asked Dec 18 '25 13:12

tblznbits


2 Answers

You can write a generator:

idx = [1, 4, 8, 10, 22]

def differences(nums):
    n = len(nums)
    for i in xrange(n-1):
        for j in xrange(i+1,n):
            yield abs(nums[i]-nums[j])

for d in differences(idx): print d

Output:

3
7
9
21
4
6
18
2
14
12

This produces the differences one by one, with very little memory overhead. Also, using the idea of @RoryDaulton , this doesn't bother computing differences that are redundant or known to be zero.

like image 166
John Coleman Avatar answered Dec 21 '25 01:12

John Coleman


Note that your result will actually hold n*n elements. So if your length is 25,039 (as stated in the comments) you have to take care not to hit a MemoryError with these 630 million elements.

One way would be to go numpy and choose a small enough dtype:

import numpy as np
arr = np.array(your_list, dtype=np.int16)
diffs = arr[:, None] - arr   # this will be a 25039 x 25039 array containing your desired differences.

Even using np.int16 you'll need a lot of memory (I used an array size of 25000 here):

>>> diffs.nbytes
1_250_000_000

If you're operating on int32 you'll roughly need 2.5GB to hold the result - with longs (pythons default if you're not in the unlimited precision integer range) this goes up to 5GB ignoring any python object overhead. Note that the range of values for int16 is only −32768 to 32767 so you may need to choose a bigger dtype if required.

Besides the memory-issues there is no way to beat numpy in terms of efficiency for such operations. It took roughly 1.2s on my computer to calculate the 25k x 25k array of differences.


If you need the number of occurences where the difference is smaller than some value, then numpy again provides an easy solution:

num = np.sum(np.abs(diffs) < some_value) 
like image 42
MSeifert Avatar answered Dec 21 '25 01:12

MSeifert



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!