Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The fastest way to do a collection subtraction

I have two Sets. Set b is the subset of Set a. they're both very huge Sets. I want to subtract b from a , what's the best practice to do this common operation ? I've written to many codes like this , and I don't think it's efficient. what's your idea ?

pseudo code : (this is not Java API) .

for(int i = 0 ; i < a.size(); i++) {
          for (int j=0 ; j < b.size() ;j++) {
              // do comparison , if found equals ,remove from a
              break;
          }
 }

And I want to find an algorithm , not only applies to Sets, also works for Array.

EDIT: The Set here is not JAVA API , it's a data structure . so I don't care if Java API has a removeAll() method , I want find a common solution for this problem , I've encounter a lot of problems like this when I use Javascript and Actionscript .

like image 724
Sawyer Avatar asked Mar 08 '10 12:03

Sawyer


1 Answers

I don't think you will get it much faster, but your code will look simpler and will not become slower by a.removeAll(b);. removeAll() is part of the Java-API.

For efficiency analysis: Your given code example is O(n^2), that scales not very good, but also isn't the horriblest thing on earth (exponential complexity is the thing you don't want). As long as you don't know the internal organisation of the data in the Collection, you will not get a better performance. removeAll() is implemented by the class itself and knows about the internal organisation. So if the data is organized in a Hash, you may get better results, if the data is organized in an unsorted array, the complexity will be the same. A Set have to efficently lookup if a new item is already in the set, so I suspect some sort of Hash as internal representation, especially if the implementation is called HashSet. :-)

EDIT: The OP changed it's question to mention it is not only for Java. removeAll() is a Java-API, so this (or something similar) may not be available in other languages. As said before, if the collections are unsorted arrays with no other restrictions the two for-loops are already the fastest solution. But if the data is organized different you have faster options. If the two collections are sorted data (in my example comes the smallest element first), you can do the following (reducing the complexity to O(n)):

int bIndex = 0;
for(int i = 0 ; i < a.size(); i++) {
          while (a[i] < b[bIndex]) {bIndex++;}
          if (a[i] == b[bIndex]) {markForRemoval(a[i]);} // I mark this only for removal, as the actual removal would make your index incorrect
}

If the data is organized as a hash in both collections you also need only one for-loop, accessing directly the element in b. Other possible organizations of data are possible.

like image 147
Mnementh Avatar answered Oct 11 '22 03:10

Mnementh