Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging huge sets (HashSet) in Scala

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).

Currently, I am adding all items of one set to the other with:

setOne ++= setTwo

This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).

Any ideas how to speed things up?

like image 539
Alexandros Avatar asked Aug 03 '11 11:08

Alexandros


2 Answers

You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:

setOne.par ++ setTwo

or

(setOne.par /: setTwo)(_ + _)
like image 182
Vasil Remeniuk Avatar answered Nov 20 '22 18:11

Vasil Remeniuk


There are a few things you might wanna try:

  • Use the sizeHint method to keep your sets at the expected size.
  • Call useSizeMap(true) on it to get better hash table resizing.

It seems to me that the latter option gives better results, though both show improvements on tests here.

like image 45
Daniel C. Sobral Avatar answered Nov 20 '22 18:11

Daniel C. Sobral