Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find subset with K elements that are closest to eachother

Given an array of integers size N, how can you efficiently find a subset of size K with elements that are closest to each other?

Let the closeness for a subset (x1,x2,x3,..xk) be defined as:

enter image description here

2 <= N <= 10^5

2 <= K <= N

constraints: Array may contain duplicates and is not guaranteed to be sorted.

My brute force solution is very slow for large N, and it doesn't check if there's more than 1 solution:

N = input()
K = input()
assert 2 <= N <= 10**5
assert 2 <= K <= N
a = []
for i in xrange(0, N):
    a.append(input())
a.sort()

minimum = sys.maxint
startindex = 0

for i in xrange(0,N-K+1):
    last = i + K
    tmp = 0
    for j in xrange(i, last):
        for l in xrange(j+1, last):
            tmp += abs(a[j]-a[l])
            if(tmp > minimum):
                break

    if(tmp < minimum):
        minimum = tmp
        startindex = i #end index = startindex + K?

Examples:

N = 7
K = 3
array = [10,100,300,200,1000,20,30]
result = [10,20,30]

N = 10
K = 4
array = [1,2,3,4,10,20,30,40,100,200]
result = [1,2,3,4]
like image 370
Nima G Avatar asked Oct 20 '13 20:10

Nima G


1 Answers

Your current solution is O(NK^2) (assuming K > log N). With some analysis, I believe you can reduce this to O(NK).

The closest set of size K will consist of elements that are adjacent in the sorted list. You essentially have to first sort the array, so the subsequent analysis will assume that each sequence of K numbers is sorted, which allows the double sum to be simplified.

Assuming that the array is sorted such that x[j] >= x[i] when j > i, we can rewrite your closeness metric to eliminate the absolute value:

enter image description here

Next we rewrite your notation into a double summation with simple bounds:

enter image description here

Notice that we can rewrite the inner distance between x[i] and x[j] as a third summation:

enter image description here

where I've used d[l] to simplify the notation going forward:

enter image description here

Notice that d[l] is the distance between each adjacent element in the list. Look at the structure of the inner two summations for a fixed i:

j=i+1         d[i]
j=i+2         d[i] + d[i+1]
j=i+3         d[i] + d[i+1] + d[i+2]
...
j=K=i+(K-i)   d[i] + d[i+1] + d[i+2] + ... + d[K-1]

Notice the triangular structure of the inner two summations. This allows us to rewrite the inner two summations as a single summation in terms of the distances of adjacent terms:

total: (K-i)*d[i] + (K-i-1)*d[i+1] + ... + 2*d[K-2] + 1*d[K-1]

which reduces the total sum to:

enter image description here

Now we can look at the structure of this double summation:

i=1     (K-1)*d[1] + (K-2)*d[2] + (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
i=2                  (K-2)*d[2] + (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
i=3                               (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
...
i=K-2                                                2*d[K-2] + d[K-1]
i=K-1                                                           d[K-1]

Again, notice the triangular pattern. The total sum then becomes:

1*(K-1)*d[1] + 2*(K-2)*d[2] + 3*(K-3)*d[3] + ... + (K-2)*2*d[K-2] 
  + (K-1)*1*d[K-1]

Or, written as a single summation:

enter image description here

This compact single summation of adjacent differences is the basis for a more efficient algorithm:

  1. Sort the array, order O(N log N)
  2. Compute the differences of each adjacent element, order O(N)
  3. Iterate over each N-K sequence of differences and calculate the above sum, order O(NK)

Note that the second and third step could be combined, although with Python your mileage may vary.

The code:

def closeness(diff,K):
  acc = 0.0
  for (i,v) in enumerate(diff):
    acc += (i+1)*(K-(i+1))*v
  return acc

def closest(a,K):
  a.sort()
  N = len(a)
  diff = [ a[i+1] - a[i] for i in xrange(N-1) ]

  min_ind = 0
  min_val = closeness(diff[0:K-1],K)

  for ind in xrange(1,N-K+1):
    cl = closeness(diff[ind:ind+K-1],K)
    if cl < min_val:
      min_ind = ind
      min_val = cl

  return a[min_ind:min_ind+K]
like image 163
sfstewman Avatar answered Oct 29 '22 16:10

sfstewman