Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Intersection of two arrays, retaining order in larger array

I have a numpy array a of length n, which has the numbers 0 through n-1 shuffled in some way. I also have a numpy array mask of length <= n, containing some subset of the elements of a, in a different order.

The query I want to compute is "give me the elements of a that are also in mask in the order that they appear in a".

I had a similar question here, but the difference was that mask was a boolean mask instead of a mask on the individual elements.

I've outlined and tested 4 methods below:

import timeit
import numpy as np
import matplotlib.pyplot as plt

n_test = 100
n_coverages = 10

np.random.seed(0)


def method1():
    return np.array([x for x in a if x in mask])


def method2():
    s = set(mask)
    return np.array([x for x in a if x in s])


def method3():
    return a[np.in1d(a, mask, assume_unique=True)]


def method4():
    bmask = np.full((n_samples,), False)
    bmask[mask] = True
    return a[bmask[a]]


methods = [
    ('naive membership', method1),
    ('python set', method2),
    ('in1d', method3),
    ('binary mask', method4)
]

p_space = np.linspace(0, 1, n_coverages)
for n_samples in [1000]:
    a = np.arange(n_samples)
    np.random.shuffle(a)

    for label, method in methods:
        if method == method1 and n_samples == 10000:
            continue
        times = []
        for coverage in p_space:
            mask = np.random.choice(a, size=int(n_samples * coverage), replace=False)
            time = timeit.timeit(method, number=n_test)
            times.append(time * 1e3)
        plt.plot(p_space, times, label=label)
    plt.xlabel(r'Coverage ($\frac{|\mathrm{mask}|}{|\mathrm{a}|}$)')
    plt.ylabel('Time (ms)')
    plt.title('Comparison of 1-D Intersection Methods for $n = {}$ samples'.format(n_samples))
    plt.legend()
    plt.show()

Which produced the following results:

enter image description here

So, binary mask, is, without a doubt, the fastest method of these 4 for any size of the mask.

My question is, is there a faster way?

like image 805
michaelsnowden Avatar asked Oct 30 '22 10:10

michaelsnowden


1 Answers

So, binary mask, is, without a doubt, the fastest method of these 4 for any size of the mask.

My question is, is there a faster way?

I totally agree that binary mask method is the fastest one. I also don't think there could be any better ways in terms of computation complexity to do what you need.

Let me analyse your method time results:

  1. Method running time is T = O(|a|*|mask|) time. Every element of a is checked to be present in mask by iterating over every its element. It gives O(|mask|) time per element in the worst case when element is missing in mask. |a| does not change, consider it a constant.
    |mask| = coverage * |a|
    T = O(|a|2 * coverage)
    Hence a linear dependency of coverage in plot. Note that running time has quadratic dependency of |a|. If |mask| ≤ |a| and |a| = n then T = O(n2)

  2. Second method is using set. Set is a data-structure that performs operations of insertion/lookup in O(log(n)), where n is a number of elements in the set. s = set(mask) takes O(|mask|*log(|mask|)) to complete because there are |mask| insertion operations.

    x in s is a lookup operation. So second row runs in O(|a|*log(|mask|))

    Overall time complexity is O(|mask|*log(|mask|) + |a|*log(|mask|)). If |mask| ≤ |a| and |a| = n then T = O(n*log(n)). You probably observe f(x) = log(x) dependency on plot.

  3. in1d runs in O(|mask|*log(|mask|) + |a|*log(|mask|)) as well. Same T = O(n*log(n)) complexity and f(x) = log(x) dependency on plot.

  4. Time complexity is O(|a| + |mask|) which is T = O(n) and its the best. You observe constant dependency on plot. Algorithm simply iterates over a and mask arrays couple of times.

The thing is that if you have to output n items you will already have T = O(n) complexity. So this method 4 algorithm is optimal.

P.S. In order to observe mentioned f(n) dependencies you'd better vary |a| and let |mask| = 0.9*|a|.

EDIT: Looks like python set indeed performs lookup/insert in O(1) using hash table.

like image 150
Ivan Gritsenko Avatar answered Nov 15 '22 06:11

Ivan Gritsenko