I am looking for suggestions on what kind of data-structure to use for extremely large structures in OCaml that scale well.
By scales well, I don't want stack overflows, or exponential heap growth, assuming there is enough memory. So this pretty much eliminates the standard lib's List.map function. Speed isn't so much an issue.
But for starters, let's assume I'm operating in the realm of 2^10 - 2^100 items.
There are only three "manipulations" I perform on the structure:
(1) a map function on subsets of the structure, which either increases or decreases the structure
(2) scanning the structure
(3) removal of specific pairs of items in the structure that satisfy a particular criterion
Originally I was using regular lists, which is still highly desirable, because the structure is constantly changing. Usually after all manipulations are performed, the structure has at most either doubled in size (or something thereabouts), or reduced to the empty list []. Perhaps the doubling dooms me from the beginning but it is unavoidable.
In any event, around 2^15 --- 2^40 items start causing severe problems (probably due to the naive list functions I was using as well). The program uses 100% of the cpu, but almost no memory, and generally after a day or two it stack-overflows.
I would prefer to start using more memory, if possible, in order to continue operating in larger spaces.
Anyway, if anyone has any suggestions it would be much appreciated.
If you have enough space, in theory, to contain all items of your data structure, you should look at data structures that have an efficient memory representation, with as few bookeeping as possible. Dynamic arrays (that you resize exponentially when you need more space) are more efficiently stored than list (that pay a full word to store the tail of each cell), so you'd get roughly twice as much elements for the same memory use.
If you cannot hold all elements in memory (this is what your number look like), you should go for a more abstract representation. It's difficult to tell more without more information on what your elements are. But maybe an example of abstract representation would help you devise what you need.
Imagine that I want to record set of integers. I want to make unions, intersections of those sets, and also some more funky operations such as "get all elements that are multiple". I want to be able to do that for really large sets (zillions of distinct integers), and then I want to be able to pick one element, any one, in this set I have built. Instead of trying to store lists of integers, or set of integers, or array of booleans, what I can do is store the logical formulas corresponding to the definition of those sets: a set of integers P is characterized by a formula F such that F(n) ⇔ n∈P. I can therefore define a type of predications (conditions):
type predicate =
| Segment of int * int (* n ∈ [a;b] *)
| Inter of predicate * predicate
| Union of predicate * predicate
| Multiple of int (* n mod a = 0 *)
Storing these formulas requires little memory (proportional to the number of operations I want to apply in total). Building the intersection or the union takes constant time. Then I'll have some work to do to find an element satisfying the formula; basically I'll have to reason about what those formulas mean, get a normal form out of them (they are all of the form "the elements of a finite union of interval satisfying some modulo criterions"), and from there extract some element.
In the general case, when you get a "command" on your data set, such that "add the result of mapping over this subset", you can always, instead of actually evaluating this command, store this as data – the definition of your structure. The more precisely you can describe those commands (eg. you say "map", but storing an (elem -> elem) function will not allow you to reason easily on the result, maybe you can formulate that mapping operation as a concrete combination of operations), the more precisely you will be able to work on them at this abstract level, without actually computing the elements.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With