Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare large lists and extract missing data

I have two very large ArrayList, each containing millions of data. I want to filter out data from List1 which is not present in List2 and / or vice-versa.

I've tried Apache CollectionUtils, Java 8 stream API without any success.

Java 8 parallel streaming is consuming all the CPU and CollectionUtils keeps on comparing data set without any output.

POJO Sample

public DataVO {
 private String id;
 private String value;
 ...
 // getters / setters

 @Override
 public int hashCode() {
  final int prime = 31;
  int result = 1;
  result = (prime * result) + ((id == null) ? 0 : id.hashCode());
  return result;
 }

 @Override
 public boolean equals(final Object obj) {
  ...
  ...
  final DataVO other = (DataVO) obj;
  if (id == null) {
   if (other.id != null) {
    return false;
   }
  }
  else if (!id.equals(other.id)) {
   return false;
  }
  return true;
 }
}

hashCode() / equals() can have more fields, for now I've kept it simple.

I also tried spliting List1 into smaller chunks and then tried comparing against List2 without any results. I've looked at other questions but none of them have consider extremely large volume.

Please let me know if you have any pointers.

like image 412
Tech Sawy Avatar asked Sep 28 '18 15:09

Tech Sawy


1 Answers

You could read big chunks of the ArrayList into a HashSet, say by 10k elements. Make sure you set the size on the HashSet constructor. Then for each chunk call HashSet#RemoveAll with the other ArrayList. The remaining entries are your answer. Might even parallelise with a ThreadPoolExecutor.

List missing = new ArrayList(); // answer

for (int i = 0; i < list1.size(); ) {
    int offset = i;
    i += 16 * 1024;
    if (i > list1.size()) i = list1.size();
    Set chunk = new HashSet(list1.subList(offset, i));

    for (int j = list2.size(); --j >= 0; chunk.remove(list2.get(j));
    missing.addAll(chunk);
}
like image 110
Pascal de Kloe Avatar answered Nov 04 '22 22:11

Pascal de Kloe