Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient algorithm to find number of elements less than a query

I have two unsorted arrays a and b. For every element a[i] I need to find the number of elements b[j] such that b[j] < a[i]. In addition b may contain duplicates which should not be counted. Both arrays may be very large.

I tried (for a single query x)

public static void main(String arg[]) {
    int x = 5;
    int b[] = {2, 4, 3, 4, 11, 13, 17};
    Arrays.sort(b);
    int count = 0;
    for(int i = 0; i < b.length; ++i) {
        if(b[i] < x) {
            if(i == 0)
                ++count;
            else {
                // this check avoids counting duplicates
                if(b[i - 1] != b[i])
                    ++count;
            }
        } else {
            break;
        }
    }
    System.out.println(count);
}

My problem is that this doesn't perform well enough when querying all elements of a iteratively. What can I do to speed this up?

like image 893
vks Avatar asked Nov 12 '16 10:11

vks


2 Answers

EDIT: given the later comments, some updates which I just put right in the beginning; leaving my first text at the bottom.

So, the core aspects here are:

  1. You came here with some problem X, but further asking told us that you actually had some problem Y to solve. That is something that you should try to avoid: when coming here (or when working on problems on your own!) ... then you should be able to clearly describe the problem you have or intend to solve. I am not fingerpointing here; just expressing that you should work hard to ensure that you understand what your real problem is.
  2. This is also visible from the fact that you are asking us what to do about duplicate numbers in your data. Err, sir: it is your problem. We do not know why you want to count those numbers; we do not know where your data is coming from; and how the final solution should deal with duplicate entries. In that sense, I am just rephrasing the first paragraph: you have to clarify your requirements. We can't help with that part at all. And you see: you only mentioned duplicates in the second array. What about those in the first one?!

OK, so about your problem. Thing is: actually, that is just "work". There is no magic there. As you have two very large arrays, working on unsorted data is an absolute no-go.

So, you start by sorting both arrays.

Then you iterate over the first array and while doing that, you also look into the second array:

int indexWithinB = 0;
int counterForCurrentA = 0; // and actually ALL values from a before
for (int i=0; i<a.length; i++) {
  int currentA = a[i];     
  while (b[indexWithinB] < currentA) {
    if (indexWithinB > 0) { // check required to avoid using 0-1
      if (b[indexWithinB-1] != b[indexWithinB] { // avoid counting duplicates!
        counterForCurrentA++;
      }
    }
    indexWithinB++;
  }
  // while loop ended, this means: b[indexWithinB] == or > currentA
  // this also means: counterForCurrentA ... should have the correct value
}

The above is obviously pseudo code. It is meant to keep you going; and it might very well be, that there are subtle errors in there. For example, as Andreas pointed out: the above needs to be enhanced to check for b.length, too. But that is left as exercise to the reader.

That is what I meant with "just work": you simply have to sit down, write testcases and refine my draft algorithm until it does the job for you. If you find it too hard to program this initially, then take a piece of paper, put down two arrays with numbers ... and do that counting manually.

Final hint: I suggest to write plenty of unit tests to test your algorithm (such stuff is perfect for unit tests); and make sure that you have all your corner cases in such tests. You want to be 100% sure that your algorithm is correct before going for your 10^5 element arrays!

And here, as promised, the original answer:

Simply spoken: iterating and counting is the most efficient way to solve this problem. So in your above case, leaving out the sorting might lead to quicker overall execution time.

The logic there is really simple: in order to know the count of numbers smaller than x ... you have to look at all of them. Thus you have to iterate the complete array (when that array is not sorted).

Thus, given your initial statement, there is no other thing than: iterate and count.

Of course, if you have to this counting multiple times ... it might be worth sorting that data initially. Because then you can use binary search, and getting that count you are looking for works without iterating all data.

And: what makes you think that iterating an array with 10^5 elements is a problem? In other words: are you just worried about a potential performance problem, or do you have a real performance problem? You see, at some point you probably had to create and fill that array. That for sure took more time (and resources) than a simple for-loop to count entries. And honestly: unless we are talking some small embedded device ... 10^5 elements ... that is close to nothing, even when using slightly outdated hardware.

Finally: when you are worried about runtime, the simple answer is: slice your input data, and use 2,4, 8, ... threads to count each slice in parallel! But as said: before writing that code, I would do some profiling be sure that you really have to spent precious development time on this. Don't solve hypothetical performance problems; focus on those that really matter to you or your users!

like image 177
GhostCat Avatar answered Sep 21 '22 00:09

GhostCat


Comapring every item in the array with x will take you O(n) time. Sorting the array will take O(n log n), and then you can use binary search, which is O(log n), and you get a total of O(n log n). So the most efficient way is also the trivial one - just loop thru the array and compare every item with x.

public static void main(String arg[] ){
    int b[]={2, 4, 3, 4, 11, 13, 17};
    int x=5;
     int count=0;
     for(int i=0;i<b.length;i++){
         if(b[i]<x){          
             count++;
         }
     }
     System.out.println(count);
}
like image 34
TDG Avatar answered Sep 22 '22 00:09

TDG