Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy duplicates within an index range

The following is an interview question that a friend was asked. We're both stumped. The statement is as follows:

Consider an unsorted list of integers, A. Suppose A has size n. Suppose, further, that the element at index i has value x. We define two terms:

  • A has "a duplicate in index range k" if the value x is at any of A[i-k], A[i-k+1],...,A[i-1],A[i+1],...,A[i+k] (i.e. k is the radius of the search across indicies)

  • A has "a duplicate in value range d" if any element of A lies in the range x-d, x-d+1,...,x,x+1,...,x+d (i.e. d is the radius of the search across values)

In O(n) time and O(k) space, we want to know:

  1. Does A have a duplicate in index range k?

  2. Does A have a duplicate in index range k AND value range d?

The first one is easy: we walk from index 0 to n-1 and keep a set of the past k numbers we've seen. If our current number is already in the set before we add it, we have a duplicate. Otherwise we don't, and we remove the oldest element in the set to make room for the new one.

The second one, I have no idea. I can think of a solution in O(nd) time and O(k) space: keep the same set as we did for the first problem, but, when we hit x, check for every number between x-d and x+d. But I can't imagine what a solution that doesn't have d in either the space or time complexity would look like. What am I missing?

like image 454
Patrick Collins Avatar asked Sep 18 '15 06:09

Patrick Collins


2 Answers

You can do the second part by an extension of your first method.

Instead of storing the raw value in the set, instead use a dictionary and store the value divided by d+1 (mapping to the actual value).

The main point is that we can be sure that no 2 values can lie in the same bin (or they are closer than d apart).

Now when you scan through, you need to check for possible collisions in the bin plus the two neighbouring bins.

So this is O(3n) = O(n) operations and O(k) space.

Example Python Code

def check_for_duplicate(A,d,k):
    D = {} # Stores the previous 2k entries 
    for i,a in enumerate(A):
        b = a // (d+1)
        for delta in [-1,0,1]:
            if b+delta in D and abs(D[b+delta] - a) <= d:
                return True
        old = i - 2 * k
        if old >= 0:
            del D[ A[old] // (d+1) ]
        D[b] = a
    return False

print check_for_duplicate(A=[8,14],d=5,k=2)
like image 112
Peter de Rivaz Avatar answered Oct 10 '22 07:10

Peter de Rivaz


So it occurs to me, you could just do

part1 = false;
part2 = false;
for (index = i - k; index <= i + k, index++) {
  diff = ABS(A[index] - x);
  if (diff == 0) { part1 = true; }
  if (diff < d) { part2 = true; }
}

(not in python, but the algorithm seems sufficient) Let me know if I'm totally missing the point, it's late here.

like image 1
CollinD Avatar answered Oct 10 '22 08:10

CollinD