I have two very large ArrayList
, each containing millions of data. I want to filter out data from List1
which is not present in List2
and / or vice-versa.
I've tried Apache CollectionUtils, Java 8 stream API without any success.
Java 8 parallel streaming is consuming all the CPU and CollectionUtils keeps on comparing data set without any output.
POJO Sample
public DataVO {
private String id;
private String value;
...
// getters / setters
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = (prime * result) + ((id == null) ? 0 : id.hashCode());
return result;
}
@Override
public boolean equals(final Object obj) {
...
...
final DataVO other = (DataVO) obj;
if (id == null) {
if (other.id != null) {
return false;
}
}
else if (!id.equals(other.id)) {
return false;
}
return true;
}
}
hashCode() / equals() can have more fields, for now I've kept it simple.
I also tried spliting List1
into smaller chunks and then tried comparing against List2
without any results. I've looked at other questions but none of them have consider extremely large volume.
Please let me know if you have any pointers.
You could read big chunks of the ArrayList
into a HashSet
, say by 10k elements. Make sure you set the size on the HashSet
constructor. Then for each chunk call HashSet#RemoveAll
with the other ArrayList
. The remaining entries are your answer. Might even parallelise with a ThreadPoolExecutor
.
List missing = new ArrayList(); // answer
for (int i = 0; i < list1.size(); ) {
int offset = i;
i += 16 * 1024;
if (i > list1.size()) i = list1.size();
Set chunk = new HashSet(list1.subList(offset, i));
for (int j = list2.size(); --j >= 0; chunk.remove(list2.get(j));
missing.addAll(chunk);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With