Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find Value that Partitions two Numpy Arrays Equally

I have two arrays (x1 and x2) of equal length that have overlapping ranges of values.

I need to find a value q such that l1-l2 is minimized, and

l1 = x1[np.where(x1 > q)].shape[0]
l2 = x2[np.where(x2 < q)].shape[0]

I need this to be reasonably high-performance since the arrays can be large. A solution using native numpy routines would be preferred.

like image 340
xvtk Avatar asked Mar 12 '14 14:03

xvtk


3 Answers

There may be a smarter way to look for a value, but you can do an exhaustive search as follows:

>>> x1 = np.random.rand(10)
>>> x2 = np.random.rand(10)
>>> x1.sort()
>>> x2.sort()
>>> x1
array([ 0.12568451,  0.30256769,  0.33478133,  0.41973331,  0.46493576,
        0.52173197,  0.72289189,  0.72834444,  0.78662283,  0.78796277])
>>> x2
array([ 0.05513774,  0.21567893,  0.29953634,  0.37426842,  0.40000622,
        0.54602497,  0.7225469 ,  0.80116148,  0.82542633,  0.86736597])

We can compute l1 if q is one of the items in x1 as:

>>> l1_x1 = len(x1) - np.arange(len(x1)) - 1
>>> l1_x1
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

And l2 for the same q as:

>>> l2_x1 = np.searchsorted(x1, x2)
>>> l2_x1
array([ 0,  1,  1,  3,  3,  6,  6, 10, 10, 10], dtype=int64)

You can similarly get values for l1 and l2 when q is in x2:

>>> l2_x2 = np.arange(len(x2))
>>> l2_x2
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> l1_x2 = len(x1) - np.searchsorted(x1, x2, side='right')
>>> l1_x2
array([10,  9,  9,  7,  7,  4,  4,  0,  0,  0], dtype=int64)

And you then simply check for the minimum of l1 - l2:

>>> np.concatenate((l1_x1 - l2_x1, l1_x2 - l2_x2))
array([  9,   7,   6,   3,   2,  -2,  -3,  -8,  -9, -10,  10,   8,   7,
         4,   3,  -1,  -2,  -7,  -8,  -9], dtype=int64)
>>> q_idx = np.argmin(np.abs(np.concatenate((l1_x1 - l2_x1, l1_x2 - l2_x2))))
>>> q = x1[q_idx]  if q_idx < len(x1) else x2[q_idx - len(x1)]
>>> q
0.54602497466094291
>>> x1[x1 > q].shape[0]
4L
>>> x2[x2 < q].shape[0]
5L
like image 146
Jaime Avatar answered Nov 17 '22 12:11

Jaime


I think I may have found a fairly simple way to do it.

x1 = (50 - 10) * np.random.random(10000) + 10
x2 = (75 - 25) * np.random.random(10000) + 25

x1.sort()
x2.sort()
x2 = x2[::-1] # reverse the array

# The overlap point should fall where the difference is smallest
diff = np.abs(x1 - x2)

# get the index of where the minimum occurs
loc = np.where(diff == np.min(diff))

q1 = x1[loc]    # 38.79087351
q2 = x2[loc]    # 38.79110941

M4rtini's solution produces q = 38.7867527.

like image 24
xvtk Avatar answered Nov 17 '22 14:11

xvtk


This is fundamentally an interval problem so you might want to do some reading on Interval trees, but you don't need to understand interval trees to solve this problem.

If you think of every (x1[i], x2[i]) as being an interval, you're looking for the value q which splits the intervals into two groups as evenly as possible ignoring intervals that overlap q. Lets take the easy case first:

from numpy import array
x1 = array([19, 32, 47, 13, 56,  1, 87, 48])
x2 = array([44, 38, 50, 39, 85, 26, 92, 64])
x1sort = np.sort(x1)
x2sort = np.sort(x2)[::-1]
diff = abs(x2sort - x1sort)
mindiff = diff.argmin()
print mindiff, x2sort[mindiff], x1sort[mindiff]
# 4 44 47

enter image description here

@xvtk's solution works well in this case and gives us a range of [44, 47]. Because no intervals overlap the range, all values of q in the range are equivalent and yield an optimal result. Here is an example that is a little more tricky:

x1 = array([12, 65, 46, 81, 71, 77, 37])
x2 = array([ 20,  85,  59, 122, 101,  87,  58])
x1sort = np.sort(x1)
x2sort = np.sort(x2)[::-1]
diff = abs(x2sort - x1sort)
mindiff = diff.argmin()
print mindiff, x2sort[mindiff], x1sort[mindiff], x1sort[mindiff-1]
# 59 71 65

enter image description here

Here the solution gives us a range of [59, 71] but notice that not all values in the range are equivalent. Anything to the left of the green line will produce 3 and 4 intervals on the left and right respectively, while anything to the right of the green line will produce 3 intervals on both sides.

I'm pretty sure that the optimal solution is guaranteed to be in the range produced by @xvtk's solution. It's possible that one of the red lines is guaranteed to be an optimal solution, though I'm not sure on this point. Hope that helps.

like image 24
Bi Rico Avatar answered Nov 17 '22 14:11

Bi Rico