Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

marisa trie suffix compression?

Tags:

c++

python

trie

I'm using a custom Cython wrapper of this marisa trie library as a key-value multimap.

My trie entries look like key 0xff data1 0xff data2 to map key to the tuple (data1, data2). data1 is a string of variable length but data2 is always a 4-byte unsigned int. The 0xff is a delimiter byte.

I know a trie is not the most optimal data structure for this from a theoretical point of a view, but various practical considerations make it the best available choice.

In this use case, I have about 10-20 million keys, each one has on average 10 data points. data2 is redundant for many entries (in some cases, data2 is always the same for all data points for a given key), so I had the idea of taking the most frequent data2 entry and adding a ("", base_data2) data point to each key.

Since a MARISA trie, to my knowledge, does not have suffix compression and for a given key each data1 is unique, I assumed that this would save 4 bytes per data tuple that uses a redundant key (plus adding in a single 4-byte "value" for each key). Having rebuilt the trie, I checked that the redundant data was no longer being stored. I expected a sizable decrease in both serialized and in-memory size, but in fact the on-disk trie went from 566MB to 557MB (and a similar reduction in RAM usage for a loaded trie).

From this I concluded that I must be wrong about there being no suffix compression. I was now storing the entries with a redundant data2 number as key 0xff data1 0xff, so to test this theory I removed the trailing 0xff and adjusted the code that uses the trie to cope. The new trie went down from 557MB to 535MB.

So removing a single redundant trailing byte made a 2x larger improvement than removing the same number of 4-byte sequences, so either the suffix compression theory is dead wrong, or it's implemented in some very convoluted way.

My remaining theory is that adding in the ("", base_data2) entry at a higher point in the trie somehow throws off the compression in some terrible way, but it should just be adding in 4 more bytes when I've removed many more than that from lower down in the trie.

I'm not optimistic for a fix, but I'd dearly like to know why I'm seeing this behavior! Thank you for your attention.

like image 670
gmoss Avatar asked Jul 03 '17 23:07

gmoss


1 Answers

As I suspected, it's caused by padding.

in lib/marisa/grimoire/vector/vector.h, there is the following function:

void write_(Writer &writer) const {
  writer.write((UInt64)total_size());
  writer.write(const_objs_, size_);
  writer.seek((8 - (total_size() % 8)) % 8);
}

The key point is: writer.seek((8 - (total_size() % 8)) % 8);. After writing each chunk, the writer pads to the next 8 bytes boundary.

This explains the behavior you are seeing, as part of the data removed by the initial shortening of the key was replaced with padding.

When you removed the extra byte, it brought the key size below the next boundary limit, resulting in a major size change.

Practically, what this means is that, since the padding code is in the serialization part of the library, you are probably getting the in-memory savings you were expecting, but that did not translate into on-disk savings. Monitoring program RAM usage should confirm that.

If disk size is your concern, then you might as well simply deflate the serialized data, as it seems MARISA does not apply any compression whatsoever.

like image 91
Frank Avatar answered Nov 01 '22 05:11

Frank