Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compressed vector/array class with random data access

I would like to make "compressed array"/"compressed vector" class (details below), that allows random data access with more or less constant time.

"more or less constant time" means that although element access time isn't constant, it shouldn't keep increasing when I get closer to certain point of the array. I.e. container shouldn't do significantly more calculations (like "decompress everything once again to get last element", and "do almost nothing to get the first") to get one element. Can be probably achieved by splitting array into chunks of compressed data. I.e. accessing one element should take "averageTime" +- some deviation. I could say that I want best-case access time and worst-case access time to be relatively close to average access time.

What are my options (suitable algorithms/already available containers - if there are any)?

Container details:

  1. Container acts as a linear array of identical elements (such as std::vector)
  2. Once container is initialized, data is constant and never changes. Container needs to provide read-only access.
  3. Container should behave like array/std::vector - i.e. values accessed via operator[], there is .size(), etc.
  4. It would be nice if I could make it as template class.
  5. Access to data should be more or less constant-time. I don't need same access time for every element, but I shouldn't have to decompress everything to get last element.

Usage example:
Binary search on data.

Data details:
1. Data is structs mostly consisting of floats and a few ints. There are more floats than ints. No strings.
2. It is unlikely that there are many identical elements in array, so simply indexeing data won't be possible.
3. Size of one element is less than 100 bytes.
4. Total data size per container is between few kilobytes and a few megabytes.
5. Data is not sparse - it is continuous block of elements, all of them are assigned, there are no "empty slots".

The goal of compression is to reduce amount of ram the block takes when compared to uncompressed representation as array, while keeping somewhat reasonable read access performance, and allowing to randomly access elements as array. I.e. data should be stored in compressed form internally, and I should be able to access it (read-only) as if it is a std::vector or similar container.

Ideas/Opinions?

like image 821
SigTerm Avatar asked Aug 06 '10 13:08

SigTerm


3 Answers

I take it that you want an array whose elements are not stored vanilla, but compressed, to minimize memory usage.

Concerning compression, you have no exceptional insight about the structure of your data, so you're fine with some kind of standard entropy encoding. Ideally, would like like to run GZIP on your whole array and be done with it, but that would lose O(1) access, which is crucial to you.

A solution is to use Huffmann coding together with an index table.

Huffmann coding works by replacing each input symbol (for instance, an ASCII byte) with another symbol of variable bit length, depending on frequency of occurency in the whole stream. For instance, the character E appears very often, so it gets a short bit sequence, while 'W' is seldom and gets a long bit sequence.

E -> 0b10
W -> 0b11110

Now, compress your whole array with this method. Unfortunately, since the output symbols have variable length, you can no longer index your data as before: item number 15 is no longer at stream[15*sizeof(item)].

Fortunately, this problem can solved by using an additional index table index that stores where each item start in the compressed stream. In other words, the compressed data for item 15 can be found at stream[index[15]]; the index table accumulates the variable output lengths.

So, to get item 15, you simply start decompressing the bytes at stream[index[15]]. This works because the Huffmann coding doesn't do anything fancy to the output, it just concatenates the new code words, and you can start decoding inside the stream without having to decode all previous items.

Of course, the index table adds some overhead; you may want to tweak the granularity so that compressed data + index table is still smaller than original data.

like image 134
Heinrich Apfelmus Avatar answered Oct 20 '22 21:10

Heinrich Apfelmus


Are you coding for an embedded system and/or do you have hundreds or thousands of these containers? If not, while I think this is an interesting theoretical question (+1), I suspect that the slowdown as a result of doing the decompression will be non-trivial and that it would be better to use use a std::vector.

Next, are you sure that the data you're storing is sufficiently redundant that smaller blocks of it will actually be compressible? Have you tried saving off blocks of different sizes (powers of 2 perhaps) and tried running them through gzip as an exercise? It may be that any extra data needed to help the decompression algorithm (depending on approach) would reduce the space benefits of doing this sort of compressed container.

If you decide that it's still reasonable to do the compression, then there are at least a couple possibilities, none pre-written though. You could compress each individual element, storing a pointer to the compressed data chunk. Then index access is still constant, just needing to decompress the actual data. Possibly using a proxy object would make doing the actual data decompression easier and more transparent (and maybe even allow you to use std::vector as the underlying container).

Alternately, std::deque stores its data in chunks already, so you could use a similar approach here. For example std::vector<compressed_data_chunk> where each chunk holds say 10 items compressed together as your underlying container. Then you can still directly index the chunk you need, decompress it, and return the item from the decompressed data. If you want, your containing object (that holds the vector) could even cache the most recently decompressed chunk or two for added performance on consecutive access (although this wouldn't help binary search very much at all).

like image 42
Mark B Avatar answered Oct 20 '22 21:10

Mark B


I've been thinking about this for a while now. From a theoretical point of view I identified 2 possibilities:

  • Flyweight, because repetition can be lessened with this pattern.
  • Serialization (compression is some form of serialization)

The first is purely object oriented and fits well I think in general, it doesn't have the disadvantage of messing up pointers for example.

The second seems better adapted here, although it does have a slight disadvantage in general: pointer invalidation + issues with pointer encoding / decoding, virtual tables, etc... Notably it doesn't work if the items refer to each others using pointers instead of indices.

I have seen a few "Huffman coding" solutions, however this means that for each structure one needs to provide a compressing algorithm. It's not easy to generalize.

So I'd rather go the other way and use a compressing library like 'zlib', picking up a fast algorithm like lzo for example.

  • B* tree (or a variant) with large number of items per node (since it doesn't move) like say 1001. Each node contains a compressed representation of the array of items. Indices are not compressed.
  • Possibly: cache_view to access the container while storing the last 5 (or so) decompressed nodes or something. Another variant is to implement reference counting and keep the data uncompressed as long as someones got a handle to one of the items in the node.

Some remarks:

  • if you should a large number of items/keys per node you have near constant access time, for example with 1001 it means that there are only 2 levels of indirection as long as you store less than a million items, 3 levels of indirection for a billion etc...
  • you can build a readable/writable container with such a structure. I would make it so that I only recompress once I am done writing the node.
like image 45
Matthieu M. Avatar answered Oct 20 '22 22:10

Matthieu M.