Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal storage for string/integer pairs with fast lookup of strings?

I need to maintain correspondence between strings and integers, then lookup the string value and return the integer. What's the best structure to store this info that meets the following requirements:

  • Speed and memory size are important, in that order.

  • I don't want to reinvent the wheel and write my own sorting routine. A call to Sort(CompareFunction) is fine of course.

Conditions:

  • The integers are not guaranteed to be sequential, neither is there a 'start value' like 0 or 1

  • Number of data pairs can vary from 100 to 100000

  • The data are all read in at the beginning, there's no subsequent additions/deletions/modifications

  • FWIW the strings are the hex entry ID's that Outlook (MAPI?) uses to identify entries. Example: 00000000FE42AA0A18C71A10E8850B651C24000003000000040000000000000018000000000000001E7FDF4152B0E944BA66DFBF2C6A6416E4F52000487F22

There's so many options (TStringList (with objects or name/value pairs), TObjectList, TDictionary, ...) that I'd better ask for advice first...

I have read How can I search faster for name/value pairs in a Delphi TStringList? which suggest TDictionary for string/string pairs, and Sorting multidimensional array in Delphi 2007 which suggest TStringlist objects for string/integer but where sorting is done on the integers.

like image 586
Jan Doggen Avatar asked Mar 25 '23 05:03

Jan Doggen


1 Answers

The second link that you include in the question is not applicable. That is a question concerning sorting rather than efficient lookup. Although you discuss sorting a number of times in your question, you do not have a requirement to sort. Your requirement is simply a dictionary, also known as an associative array. Of course, you can implement that by sorting an array and using binary search for your lookup, but sorting is not a requirement. You simply need an efficient dictionary.

Out of the box, the most efficient and convenient data structure for your problem is TDictionary<string, Integer>. This has lookup complexity of O(1) and so scales well for large collections. For smaller collections a binary search based lookup with lookup complexity of O(log n) can be competitive and can indeed out-perform a dictionary.

Cosmin Prund wrote an excellent answer here on SO where he compared the performance of dictionary lookup against binary search based lookup. I recommend you have a read. I would say that for small containers, performance is probably not that big a problem for you. So even though binary search may be quicker, it probably does not matter because your performance is good either way. But performance probably becomes an issue for larger containers and that's where the dictionary is always stronger. For large enough containers, the performance of binary search may become unacceptable.

I'm sure that it is possible to produce more efficient implementations of dictionaries than the Embarcadero one, but I'd also say that the Embarcadero implementation is perfectly solid. It uses a decent hash function and does not have any glaring weaknesses.

In terms of memory complexity, there's little to choose between a dictionary and a sorted array. It's not possible to improve on a sorted array for memory use.

I suggest that you start with TDictionary<string, Integer> and only look beyond that if your performance requirements are not met.

like image 72
David Heffernan Avatar answered Mar 29 '23 14:03

David Heffernan