Given an array of integers size N, how can you efficiently find a subset of size K with elements that are closest to each other?
Let the closeness for a subset (x1,x2,x3,..xk) be defined as:
2 <= N <= 10^5
2 <= K <= N
constraints: Array may contain duplicates and is not guaranteed to be sorted.
My brute force solution is very slow for large N, and it doesn't check if there's more than 1 solution:
N = input()
K = input()
assert 2 <= N <= 10**5
assert 2 <= K <= N
a = []
for i in xrange(0, N):
a.append(input())
a.sort()
minimum = sys.maxint
startindex = 0
for i in xrange(0,N-K+1):
last = i + K
tmp = 0
for j in xrange(i, last):
for l in xrange(j+1, last):
tmp += abs(a[j]-a[l])
if(tmp > minimum):
break
if(tmp < minimum):
minimum = tmp
startindex = i #end index = startindex + K?
Examples:
N = 7
K = 3
array = [10,100,300,200,1000,20,30]
result = [10,20,30]
N = 10
K = 4
array = [1,2,3,4,10,20,30,40,100,200]
result = [1,2,3,4]
Your current solution is O(NK^2)
(assuming K > log N
). With some analysis, I believe you can reduce this to O(NK)
.
The closest set of size K will consist of elements that are adjacent in the sorted list. You essentially have to first sort the array, so the subsequent analysis will assume that each sequence of K
numbers is sorted, which allows the double sum to be simplified.
Assuming that the array is sorted such that x[j] >= x[i]
when j > i
, we can rewrite your closeness metric to eliminate the absolute value:
Next we rewrite your notation into a double summation with simple bounds:
Notice that we can rewrite the inner distance between x[i]
and x[j]
as a third summation:
where I've used d[l]
to simplify the notation going forward:
Notice that d[l]
is the distance between each adjacent element in the list. Look at the structure of the inner two summations for a fixed i
:
j=i+1 d[i]
j=i+2 d[i] + d[i+1]
j=i+3 d[i] + d[i+1] + d[i+2]
...
j=K=i+(K-i) d[i] + d[i+1] + d[i+2] + ... + d[K-1]
Notice the triangular structure of the inner two summations. This allows us to rewrite the inner two summations as a single summation in terms of the distances of adjacent terms:
total: (K-i)*d[i] + (K-i-1)*d[i+1] + ... + 2*d[K-2] + 1*d[K-1]
which reduces the total sum to:
Now we can look at the structure of this double summation:
i=1 (K-1)*d[1] + (K-2)*d[2] + (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
i=2 (K-2)*d[2] + (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
i=3 (K-3)*d[3] + ... + 2*d[K-2] + d[K-1]
...
i=K-2 2*d[K-2] + d[K-1]
i=K-1 d[K-1]
Again, notice the triangular pattern. The total sum then becomes:
1*(K-1)*d[1] + 2*(K-2)*d[2] + 3*(K-3)*d[3] + ... + (K-2)*2*d[K-2]
+ (K-1)*1*d[K-1]
Or, written as a single summation:
This compact single summation of adjacent differences is the basis for a more efficient algorithm:
O(N log N)
O(N)
N-K
sequence of differences and calculate the above sum, order O(NK)
Note that the second and third step could be combined, although with Python your mileage may vary.
The code:
def closeness(diff,K):
acc = 0.0
for (i,v) in enumerate(diff):
acc += (i+1)*(K-(i+1))*v
return acc
def closest(a,K):
a.sort()
N = len(a)
diff = [ a[i+1] - a[i] for i in xrange(N-1) ]
min_ind = 0
min_val = closeness(diff[0:K-1],K)
for ind in xrange(1,N-K+1):
cl = closeness(diff[ind:ind+K-1],K)
if cl < min_val:
min_ind = ind
min_val = cl
return a[min_ind:min_ind+K]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With