Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast random access to a collection

I'm consuming a stream of semi-random tokens. For each token, I'm maintaining a lot of data (including some sub-collections).

The number of unique tokens is unbounded but in practice tends to be on the order of 100,000-300,000.

I started with a list and identified the appropriate token object to update using a Linq query.

public class Model {
    public List<State> States { get; set; }
    ...
}

var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();

Over the first ~30k unique tokens, I was able to find and update ~1,100 tokens/sec.

Performance analysis shows that 85% of the total Cpu cycles are being spent on the Where(...).SingleOrDefault() (which makes sense, lists are inefficient way to search).

So, I switched the list over to a HashSet and profiled again, confident that HashSet would be able to random seek faster. This time, I was only processing ~900 tokens/sec. And a near-identical amount of time was spent on the Linq (89%).

So... First up, am I misusing the HashSet? (Is using Linq is forcing a conversion to IEnumerable and then an enumeration / something similar?)

If not, what's the best pattern to implement myself? I was under the impression that HashSet already does a Binary seek so I assume I'd need to build some sort of tree structure and have smaller sub-sets?

To answer some questions form comments... The condition is unique (if I get the same token twice, I want to update the same entry), the HashSet is the stock .Net implementation (System.Collections.Generic.HashSet<T>).

A wider view of the code is...

        var state = new RollingList(model.StateDepth); // Tracks last n items and drops older ones. (Basically an array and an index that wraps around
        var tokens = tokeniser.Tokenise(contents); // Iterator
        foreach (var token in tokens) {
            var stateText = StateToString(ref state);
            var match = model.States.Where(x => x.Condition == stateText).FirstOrDefault();
            // ... update the match as appropriate for the token
        }
like image 327
Basic Avatar asked Mar 14 '23 04:03

Basic


2 Answers

var match = model.States.Where(x => x.Condition == stateText).SingleOrDefault();

If you're doing that exact same thing with a hash set, that's no savings. Hash sets are optimized for quickly answering the question "is this member in the set?" not "is there a member that makes this predicate true in the set?" The latter is linear time whether it is a hash set or a list.

Possible data structures that meet your needs:

  • Make a dictionary mapping from text to state, and then do a search in the dictionary on the text key to get the resulting state. That's O(1) for searching and inserting in theory; in practice it depends on the quality of the hash.

  • Make a sorted dictionary mapping from text to state. Again, search on text. Sorted dictionaries keep the keys sorted in a balanced tree, so that's O(log n) for searching and inserting.

like image 91
Eric Lippert Avatar answered Mar 23 '23 16:03

Eric Lippert


30k is not that much so if state is unique you can do something like this. Dictionary access is much faster.

var statesDic = model.States.ToDictionary(x => x.Condition, x => x);
var match = statesDic.ConstainsKey(stateText) ? statesDic[stateText] : default(State);

Quoting MSDN:

The Dictionary generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.

You can find more info about Dictionaries here. Be also aware that Dictionaries use memory space to improve performance, you can do a quick test for 300k items and see what kind of space I'm talking about like this:

var memoryBeforeDic = GC.GetTotalMemory(true);
var dic = new Dictionary<string,object>(300000);
var memoryAfterDic = GC.GetTotalMemory(true);
Console.WriteLine("Memory: {0}", memoryAfterDic - memoryBeforeDic);
like image 45
Diogo Cunha Avatar answered Mar 23 '23 17:03

Diogo Cunha