Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compact data structure for storing a large set of integral values

I'm working on an application that needs to pass around large sets of Int32 values. The sets are expected to contain ~1,000,000-50,000,000 items, where each item is a database key in the range 0-50,000,000. I expect distribution of ids in any given set to be effectively random over this range. The operations I need on the set are dirt simple:

  • Add a new value
  • Iterate over all of the values.

There is a serious concern about the memory usage of these sets, so I'm looking for a data structure that can store the ids more efficiently than a simple List<int>or HashSet<int>. I've looked at BitArray, but that can be wasteful depending on how sparse the ids are. I've also considered a bitwise trie, but I'm unsure how to calculate the space efficiency of that solution for the expected data. A Bloom Filter would be great, if only I could tolerate the false negatives.

I would appreciate any suggestions of data structures suitable for this purpose. I'm interested in both out-of-the-box and custom solutions.

EDIT: To answer your questions:

  • No, the items don't need to be sorted
  • By "pass around" I mean both pass between methods and serialize and send over the wire. I clearly should have mentioned this.
  • There could be a decent number of these sets in memory at once (~100).
like image 528
Odrade Avatar asked Mar 08 '11 23:03

Odrade


1 Answers

Use the BitArray. It uses only some 6MB of memory; the only real problem is that iteration is Theta(N), i.e. you have to walk the entire range. Locality of reference is good though and you can allocate the entire structure in one operation.

As for wasting space: you waste 6MB in the worst case.

EDIT: ok, you've lots of sets and you're serializing. For serializing on disk, I suggest 6MB files :)

For sending over the wire, just iterate and consider sending ranges instead of individual elements. That does require a sorting structure.

You need lots of these sets. Consider if you have 600MB to spare. Otherwise, check out:

  • Bytewise tries: O(1) insert, O(n) iteration, much lower constant factors than bitwise tries
  • A custom hash table, perhaps Google sparsehash through C++/CLI
  • BSTs storing ranges/intervals
  • Supernode BSTs
like image 183
Fred Foo Avatar answered Sep 30 '22 19:09

Fred Foo