Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get random element from C# HashSet quickly

I need to store a set of elements. What I need is functionality to

  1. remove (single) elements and
  2. add (sets of) elements and
  3. each object should only be in the set once and
  4. get a random element from the set

I chose the HashSet (C#) since it sports fast methods for removing elements (hashSet.remove(element)), adding sets (hashSet.UnionWith(anotherHashSet)) and the nature of a HashSet guarantees that there are not duplicates, so requirements 1 to 3 are taken care of.

The only way I found to get a random element is

Object object = hashSet.ElementAt(rnd.Next(hashSet.Count));

But this is very slow, since I call it once for every pixel of my map (creating a random flood fill from multiple starting points; mapsize 500x500 at the moment but I'd like to go bigger) and the hashset holds rather many items. (A quick test shows it blows up to 5752 entries before shrinking again.)

Profiling (CPU sampling) tells me my ElementAt calls take over 50%.

I realize 500x500 operations over a big hashset is no easy task, but other operations (Remove and UnionWith) are called as often as ElementAt, so the main problem seems to be the operation and not the number of calls.

I vaguely understand why getting a certain element from a HashSet is very expensive (when compared to getting it from a list or another ordered data structure, but I just want a random pick. Can it really be so hard and is there no way around it? Is there a better data structure for my purpose?

Changing everything to Lists doesn't help because now other methods become bottlenecks and it takes even longer.

Casting the HashSet to an array and pick my random element from there expectedly doesn't help because while picking a random element from an array is quick, casting the hashset to the array in the first place takes longer than running hashSet.ElementAt by itself.

If you want to understand better what I am trying to do: A link to my question and the answer.

like image 829
Christian Geese Avatar asked May 27 '15 12:05

Christian Geese


2 Answers

I think that OrderedDictionary might suit your purposes:

var dict = new OrderedDictionary();

dict.Add("My String Key", "My String");
dict.Add(12345, 54321);

Console.WriteLine(dict[0]); // Prints "My String"
Console.WriteLine(dict[1]); // Prints 54321

Console.WriteLine(dict["My String Key"]); // Prints "My String"
Console.WriteLine(dict[(object)12345]);   // Prints 54321 (note the need to cast!)

This has fast add and remove, and O(1) indexing. It only works with object keys and values though - there's no generic version.

[EDIT] Many years later: We now have the strongly-typed generic SortedDictionary<TKey, TValue> which might be better.

like image 57
Matthew Watson Avatar answered Sep 18 '22 09:09

Matthew Watson


The basic problem is the indexing.

In an array or a list, the data is indexed by its coördinate - usually just a simple int index. In a HashSet, you pick the index yourself - the key. The side-effect is, though, that there is no "coördinate" - the question "element at index 3" doesn't make sense, really. The way it's actually implemented is that the whole HashSet is enumerated, item after item, and the n-th item is returned. This means that to get the 1000th item, you have to enumerate all the 999 items before that as well. This hurts.

The best way to solve this would be to pick the random based on an actual key of the HashSet. Of course, this only works if it's reasonable to pick random keys just like that.

If you can't pick the key at random in a satisfactory way, you'll probably want to keep two separate lists - whenever you add a new item to a HashSet, add its key to a List<TKey>; you can then easily pick a random key from the List, and follow it. Depending on your requirements, duplicates may not be much of a problem.

And of course, you could save on the ElementAt enumerations if you only do the enumeration once - for example, before searching the HashSet, you could convert it to List. This only makes sense if you're picking multiple random indices at once, of course (e.g. if you pick 5 indices at random at once, you'll save about 1/5th of the time on average) - if you're always picking one, then modifying the HashSet and picking another, it's not going to help.

Depending on your exact use case, it might also be worth having a look at SortedSet. It works in a similar way to HashSet, but it maintains order in the keys. The helpful part is that you can use the GetViewBetween method to get a whole range of keys - you could use this quite effectively if your keys are sparse, but well balanced between arbitrary ranges. You'd just first pick a range at random, then get the items in range with GetViewBetween, and pick a random one out of those as well. In effect, this will allow you to partition the search results, and should save quite a bit of time.

like image 38
Luaan Avatar answered Sep 20 '22 09:09

Luaan