Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Huge in-memory set of data. Need a fast search by integer Id property

I have a huge in-memory set (like ~100K records) of plain CLR objects of defined type. This Type has public property int Id {get; set;}. What is the best .NET structure to contain this huge set of data in to provide quick access to any item by its Id? More specifically, this set of data is supposed to be operated inside a loop to find an item by Id, so the search should be done as fast as possible. The search might look like this:

// Find by id
var entity = entities.First(e => e.Id == id)

IEnumerable based structures like collections and lists are going to go through every element of the data until seeking element is found. What are alternative ways? I believe there should be a way to make a search of sorted arrays by Id like an index search in databases.

Thanks

Results of testing: FYI: Dictionary is not just fast, it's just incomparable. My small test shown performance gain from around 3000+ ms (calling First() on IEnumerable) to 0 ([index] on Dictionary)!

like image 330
YMC Avatar asked Aug 12 '11 16:08

YMC


4 Answers

I would go with a Dictionary<TKey, TValue>:

var index = new System.Collections.Generic.Dictionary<int, T>();

where T is the type of objects that you want to access.

This is implemented as a hash table, ie. looking up an item is done by computing the key's hash value (which is usually a very quick operation) and using that hash value as an index into a table. It's perhaps a bit of a over-simplification, but with a dictionary, it almost doesn't matter how many entries you've stored in your dictionary — access time should stay approximately constant.

  • To add an entry, do index.Add(entity.Id, entity);

  • To check whether an item is in the collection, index.ContainsKey(id).

  • To retrieve an item by ID, index[id].

like image 198
stakx - no longer contributing Avatar answered Nov 15 '22 22:11

stakx - no longer contributing


Dictionary<TKey, TValue>, where TKey is int and TValue is YourEntity.

Example

var dictionary = new Dictionary<TKey, TValue>();
dictionary.Add(obj1.Id, obj1); 
// continue 

Or if you have a collection of objects, you can create the dictionary using a query

var dictionary = list.ToDictionary(obj => obj.Id, obj => obj);

Note: key values must be unique. If you have a non-unique collection, filter duplicates first (perhaps by calling Distinct() before creating the dictionary. Alternately, if you're looping over the collection to create the dictionary manually, check the ContainsKey method before attempting an Add operation.

like image 29
Anthony Pegram Avatar answered Nov 15 '22 23:11

Anthony Pegram


Generally in-memory seek is best done with the Dictionary:

System.Collections.Generic.Dictionary<TKey, TValue>

Optionally when your data set no longer fits in memory, one would use disk-based btree.

like image 4
csharptest.net Avatar answered Nov 15 '22 22:11

csharptest.net


Based on the information given, a HashTable is probably going to be the fastest. The Dictionary<T> class is going to provide you the best trade off for ease of use vs. performance. If you truly need maximum performance I would try all of the following classes. Based on memory usage, insert speed, search speed, they all perform differently:

  • ListDictionary
  • HashTable
  • Dictionary
  • SortedDictionary
  • ConcurrentDictionary

in addition to performance you may be concerned with multithreaded access. These two collections provide thread saftey:

  • HashTable (multiple reads, only one thread allowed to write)
  • ConcurrentDictionary
like image 2
Charles Lambert Avatar answered Nov 15 '22 21:11

Charles Lambert