Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reducing memory usage when storing all subsequences of a set of sequences

Tags:

common-lisp

I have a set of input sequences (represented as lists), for each of which I'm generating a set of all of its subsequences (also lists). These subsequences are stored as keys in an EQUAL hash-table, and never garbage collected or modified (the associated values, however, are modified).

My problem is that the method I'm currently using to accomplish this uses far more heap space than I'd like.

To be clear about what I'm talking about, suppose we have a sequence:

#1=(A B C D)

The subsequences are:

#2=()
#3=(A)
#4=(A B)
#5=(A B C)
#6=(A B C D)
#7=(B)
#8=(B C)
#9=(B C D)
#10=(C)
#11=(C D)
#12=(D)

The following code, which I admit isn't very good, produces this set (except the empty subsequence which I don't actually care about):

(defun subseqs-of-length (length sequence)
  (if (< (length sequence) length)
      nil
      (loop for start from 0 to (- (length sequence) length)
           collect (subseq sequence start (+ start length)))))

(defun subseqs-of-length-< (length sequence)
  (loop for len from 1 to (1- length)
     append (subseqs-of-length len sequence)))

(defun all-subseqs (sequence)
    (subseqs-of-length-< (1+ (length sequence)) sequence))

As you can probably imagine, with a lot of input sequences, this uses up a huge amount of heap.

The most obvious way I can think of to save memory is to share a bunch of list structure; for instance, you can build list 11 by (cons 'c '#12#), list 9 by (cons 'b '#11#), list 8 by (cons 'b '#10# and so forth. It would be even better if the output of multiple calls to all-subseqs could share structure. But I can't think of an elegant way to write this.

My question is two-fold:

  1. Is there an elegant way to write all-subseqs so that its results will share structure that way?
  2. It occurred to me that it might be easier to generate the subsequences naïvely and write a function to compress the results and run it periodically. Isn't this a silly idea?
  3. Is there an even more memory efficient way to store these subsequences and access/update the values associated with them? Perhaps something using arrays, or a different data structure instead of a hash-table?
like image 604
Samuel Edwin Ward Avatar asked Mar 13 '12 22:03

Samuel Edwin Ward


1 Answers

What is the minimal amount of information needed to reconstruct a subsequence?

How about the sequence, a start index, and end index. Store the start/end indexes in a two-dimensional hash table (a hash table of hash tables), or a 2-d array. Then write functions to do work on that data structure (reconstruct the subsequence given the sequence, a start index, and end index, etc).

Another idea: If you really want to store the subsequences in the data structure, you could use an N-dimensional hash table (a hash table of hash tables of hash tables ...), where N is the length of the sequence. Instead of a linked list of elements, you get a linked tree of hash tables. There is a special key in each hash table, storing the actual value of the keyed sequence all the way up to that point (that is, if this keyed sequence actually has a value). And other links in each hash table point to branches further down the tree.

And if each element in the list is a symbol, then an 'a in one list will be eq (share same mem location) as an 'a in another list. Which I think means you can get O(N) storage growth + whatever incremental growth occurs by adding links to the data structure.

like image 129
Clayton Stanley Avatar answered Oct 21 '22 10:10

Clayton Stanley