I have millions of fixed-size (100) int arrays. Each array is sorted and has unique elements. For each array, I want to find all arrays which have 70% common elements. Right now I am getting around 1 million comparisons (using Arrays.binarySearch()) per second, which is too slow for us.
Can anyone recommend a better searching algorithm?
Something like this should do the job (provided that the arrays are sorted and contain unique elements):
public static boolean atLeastNMatchingElements(final int n,
final int[] arr1,
final int[] arr2){
/* check assumptions */
assert (arr1.length == arr2.length);
final int arrLength = arr2.length;
{ /* optimization */
final int maxOffset = Math.max(arrLength - n, 0);
if(arr1[maxOffset] < arr2[0] || arr2[maxOffset] < arr1[0]){
return false;
}
}
int arr2Offset = 0;
int matches = 0;
/* declare variables only once, outside loop */
int compResult; int candidate;
for(int i = 0; i < arrLength; i++){
candidate = arr1[i];
while(arr2Offset < arrLength - 1){
compResult = arr2[arr2Offset] - candidate;
if(compResult > 0){
break;
} else{
arr2Offset++;
if(compResult == 0){
matches++;
break;
}
}
}
if(matches == n){
return true;
}
/* optimization */
else if(matches < n - (arrLength - arr2Offset)){
return false;
}
}
return false;
}
Sample usage:
public static void main(final String[] args){
final int[] arr1 = new int[100];
final int[] arr2 = new int[100];
int x = 0, y = 0;
for(int i = 0; i < 100; i++){
if(i % 10 == 0){
x++;
}
if(i % 12 == 0){
y++;
}
arr1[i] = x;
arr2[i] = y;
x++;
y++;
}
System.out.println(atLeastNMatchingElements(70, arr1, arr2));
System.out.println(atLeastNMatchingElements(95, arr1, arr2));
}
Output:
true
false
I have now tried to optimize the above code. Please check whether the code blocks marked as
/* optimization */
make a noticeable difference. After optimization, I would refactor the code to get it down to one or two return statements.
There are two quick optimisations you can make.
If array A's start element is greater than B's end element, they trivially can't have common elements.
They other one is a triangle inequality-like thing:
f(B,C) <= 100 - |f(A,B)-f(A,C)|
The reason for this is that (assuming f(A,B) > f(A,C)
) there are at least f(A,B) - f(A,C)
elements that are in both A and B but not in C. Which means that they can't be common elements of B and C.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With