Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal data structure for a special dictionary

Which data structure is best in terms of computational complexity to implement a dictionary of (key,val) items, which must support only following commands:

  1. Insert(key) - appends an item (key,val) with val=1
  2. Increment(key) - increments val of existed (key,val)
  3. Find(key) - returns a val of (key,val)
  4. Select(part_of_key) - returns a list of all items (key,val) if strstr(key,part_of_key)!=NULL in the form of a new dictionary of the same type (without allocating new memory if possible); for example if dictionary is {(red,3), (blue,4), (green,1)}, then Select(re)={(red,3), (green,1)}
  5. Max(i) - returns an item which has the i-th maximal value among all items; for example if dictionary is {(red,3), (blue,4), (green,1)}, then Max(1)=blue, Max(2)=red, Max(3)=green

The keys are strings and the values are positive integers. The number of items in the dictionary is expected to be very large.

I think it must be a synthesis of two different data structures. But should it be a hash table + a binary tree or a trie + sorted array or something else?

like image 647
psihodelia Avatar asked Sep 23 '11 11:09

psihodelia


1 Answers

A combination of balanced tree (such as red-black tree) and suffix tree (or suffix array).

  • Balanced tree: operation 1, 2 (implemented as remove + insert), 3 and 5.
  • Suffix tree: operation 4.

NOTE: Hash table will not be able to support operation 5 efficiently.

I think you'll have a hard time implementing the suffix tree. You could possibly use Mark Nelson's C++ implementation of Ukkonen's algorithm, but it has memory leaks and is essentially a singleton, so you'll need to clean it up before being ready for production use. Even after you fix it, you'll need to adjust it so it works with your "other" data structure (which is balanced tree in my proposal) instead of one big plain string.

If you do operation 1 more frequently than operation 4 and/or you can live with linear operation 4, I recommend you skip the whole complication with the suffix tree and just traverse your data structure linearly.

like image 65
Branko Dimitrijevic Avatar answered Nov 15 '22 23:11

Branko Dimitrijevic