Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently get minimum values for each pair of elements from two arrays in a third array

I have two arrays of N floats (which act as (x,y) coordinates and might have duplicates) and and associated z array of N floats (which act as weights for the coordinates).

For each (x,y) pair of floats I need to select the pair with the smallest associated z value. I've defined a selectMinz() function which does this (see code below) but it takes too long.

How could I improve the performance of this function?

import numpy as np
import time


def getData():
    N = 100000
    x = np.arange(0.0005, 0.03, 0.001)
    y = np.arange(6., 10., .05)
    # Select N values for x,y, where values can be repeated
    x = np.random.choice(x, N)
    y = np.random.choice(y, N)
    z = np.random.uniform(10., 15., N)
    return x, y, z


def selectMinz(x, y, z):
    """
    Select the minimum z for each (x,y) pair.
    """
    xy_unq, z_unq = [], []
    # For each (x,y) pair
    for i, xy in enumerate(zip(*[x, y])):
        # If this xy pair was already stored in the xy_unq list
        if xy in xy_unq:
            # If the stored z value associated with this xy pair is
            # larger than this new z[i] value
            if z_unq[xy_unq.index(xy)] > z[i]:
                # Store this smaller value instead
                z_unq[xy_unq.index(xy)] = z[i]
        else:
            # Store the xy pair, and its associated z value
            xy_unq.append(xy)
            z_unq.append(z[i])

    return xy_unq, z_unq


# Define data with the proper format.
x, y, z = getData()

s = time.clock()
xy_unq, z_unq = selectMinz(x, y, z)  # <-- TAKES TOO LONG (~15s in my system)
print(time.clock() - s)
like image 771
Gabriel Avatar asked Aug 25 '17 18:08

Gabriel


2 Answers

Steps :

  1. Use lex-sort to make the x-y pairs come in sequence. Or, we can use a scaling method to scale one of the arrays by the range of values on the other and then sum it with the other array and finally use argsort to get the lex-sorted equivalent indices.
  2. Use np.minimum.reduceat to get the minimum values in intervals, defined by the pair groupings.

Thus, we would have one vectorized solution, like so -

def selectMinz_vectorized(x, y, z):
    # Get grouped lex-sort indices
    sidx = (y + x*(y.max() - y.min() + 1)).argsort()
    # or sidx = np.lexsort([x, y])

    # Lex-sort x, y, z
    x_sorted = x[sidx]
    y_sorted = y[sidx]
    z_sorted = z[sidx]

    # Get equality mask between each sorted X and Y elem against previous ones.
    # The non-zero indices of its inverted mask gives us the indices where the 
    # new groupings start. We are calling those as cut_idx.
    seq_eq_mask = (x_sorted[1:] == x_sorted[:-1]) & (y_sorted[1:] == y_sorted[:-1])
    cut_idx = np.flatnonzero(np.concatenate(( [True], ~seq_eq_mask)))

    # Use those cut_idx to get intervalled minimum values
    minZ = np.minimum.reduceat(z_sorted, cut_idx)

    # Make tuples of the groupings of x,y and the corresponding min Z values
    return (zip(x_sorted[cut_idx], y_sorted[cut_idx]), minZ.tolist())

Sample run -

In [120]: np.c_[x,y,z]
Out[120]: 
array([[  0.,   1.,  69.],
       [  2.,   0.,  47.],
       [  1.,   0.,  62.],
       [  0.,   2.,  33.],
       [  1.,   7.,  32.],
       [  1.,   0.,  50.],
       [  2.,   0.,  55.]])

In [121]: selectMinz(x,y,z) # original method
Out[121]: 
([(0.0, 1.0), (2.0, 0.0), (1.0, 0.0), (0.0, 2.0), (1.0, 7.0)],
 [69.0, 47.0, 50.0, 33.0, 32.0])

In [122]: selectMinz_vectorized(x,y,z)
Out[122]: 
([(1.0, 0.0), (2.0, 0.0), (0.0, 1.0), (0.0, 2.0), (1.0, 7.0)],
 [50.0, 47.0, 69.0, 33.0, 32.0])

Here's my initial approach that involved creating a stacked array and then perform those operations. The implementation looked something like this -

def selectMinz_vectorized_v2(x, y, z):
    d = np.column_stack((x,y,z))
    sidx = np.lexsort(d[:,:2].T)
    b = d[sidx]  
    cut_idx = np.r_[0,np.flatnonzero(~(b[1:,:2] == b[:-1,:2]).all(1))+1]
    minZ = np.minimum.reduceat(b[:,-1], cut_idx)
    return ([tuple(i) for i in b[cut_idx,:2]], minZ.tolist())

Benchmarking for the vectorized approaches

Approaches -

# Pruned version of the approach posted earlier
def selectMinz_vectorized_pruned(x, y, z):
    sidx = (y + x*(y.max() - y.min() + 1)).argsort()
    x_sorted = x[sidx]
    y_sorted = y[sidx]
    z_sorted = z[sidx]
    seq_eq_mask = (x_sorted[1:] == x_sorted[:-1]) & (y_sorted[1:] == y_sorted[:-1])
    cut_idx = np.flatnonzero(np.concatenate(( [True], ~seq_eq_mask)))
    minZ = np.minimum.reduceat(z_sorted, cut_idx)
    return x_sorted[cut_idx], y_sorted[cut_idx], minZ

def numpy_indexed_app(x,y,z): # @Eelco Hoogendoorn's soln
    return npi.group_by((x, y)).min(z)

Timings -

In [141]: x,y,z=getData(10000)

In [142]: %timeit selectMinz_vectorized_pruned(x, y, z)
     ...: %timeit numpy_indexed_app(x,y,z)
     ...: 
1000 loops, best of 3: 763 µs per loop
1000 loops, best of 3: 1.09 ms per loop

In [143]: x,y,z=getData(100000)

In [144]: %timeit selectMinz_vectorized_pruned(x, y, z)
     ...: %timeit numpy_indexed_app(x,y,z)
     ...: 
100 loops, best of 3: 8.53 ms per loop
100 loops, best of 3: 12.9 ms per loop
like image 83
Divakar Avatar answered Nov 15 '22 05:11

Divakar


Changing the data structure of xy_unq and z_unq to a dictionary that holds both pieces of information brought the time down from ~7s to ~0.1s on my system.

def selectMinz(x, y, z):
    """
    Select the minimum z for each (x,y) pair.
    """
    xy_unq = {}
    # For each (x,y) pair
    for i, xy in enumerate(zip(*[x, y])):
        # If this xy pair was already stored in the xy_unq list
        if xy in xy_unq:
            # If the stored z value associated with this xy pair is
            # larger than this new z[i] value
            if xy_unq[xy] > z[i]:
                # Store this smaller value instead
                xy_unq[xy] = z[i]
        else:
            # Store the xy pair, and its associated z value
            xy_unq[xy] = z[i]

    return xy_unq.keys(), xy_unq.values()

The time for the above approach for me ranged from ~0.106s to ~0.11s. Here is an alternative approach with fewer lines of code, but takes slightly more time (~0.14):

def selectMinz(x, y, z):
    """
    Select the minimum z for each (x,y) pair.
    """
    xy_unq = {}
    # For each (x,y) pair
    for i, xy in enumerate(zip(*[x, y])):
        # If this xy pair was already stored in the xy_unq list
        if xy in xy_unq:
            # Store the value that is smaller between the current stored value and the new z[i]
            xy_unq[xy] = min(xy_unq[xy], z[i])
        else:
            # Store the xy pair, and its associated z value
            xy_unq[xy] = z[i]

    return xy_unq.keys(), xy_unq.values()
like image 34
Antimony Avatar answered Nov 15 '22 05:11

Antimony