I have a collection of sets that I'd like to place in a trie.
Normal tries are made of strings of elements - that is, the order of the elements is important. Sets lack a defined order, so there's the possibility of greater compression.
For example, given the strings "abc"
, "bc"
, and "c"
, I'd create the trie:
(*,3) -> ('a',1) -> ('b',1) -> ('c',1)
-> ('b',1) -> ('c',1)
-> ('c',1)
But given the sets { 'a', 'b', 'c' }
, { 'b', 'c' }
, { 'c' }
, I could create the above trie, or any of these eleven:
(*,3) -> ('a',1) -> ('b',1) -> ('c',1)
-> ('c',2) -> ('a',1)
(*,3) -> ('a',1) -> ('c',1) -> ('b',1)
-> ('b',1) -> ('c',1)
-> ('c',1)
(*,3) -> ('a',1) -> ('c',1) -> ('b',1)
-> ('c',2) -> ('a',1)
(*,3) -> ('b',2) -> ('a',1) -> ('c',1)
-> ('c',1)
-> ('c',1)
(*,3) -> ('b',1) -> ('a',1) -> ('c',1)
-> ('c',2) -> ('b',1)
(*,3) -> ('b',2) -> ('c',2) -> ('a',1)
-> ('c',1)
(*,3) -> ('b',1) -> ('c',1) -> ('a',1)
-> ('c',2) -> ('b',1)
(*,3) -> ('c',2) -> ('a',1) -> ('b',1)
-> ('b',1) -> ('c',1)
(*,3) -> ('c',2) -> ('a',1) -> ('b',1)
-> ('b',1)
(*,3) -> ('c',2) -> ('b',1) -> ('a',1)
-> ('b',1) -> ('c',1)
(*,3) -> ('c',3) -> ('b',2) -> ('a',1)
So there's obviously room for compression (7 nodes to 4).
I suspect defining a local order at each node dependent on the relative frequency of its children would do it, but I'm not certain, and it might be overly expensive.
So before I hit the whiteboard, and start cracking away at my own compression algorithm, is there an existing one? How expensive is it? Is it a bulk process, or can it be done per-insert/delete?
A compression algorithm is often called compressor and the decompression algorithm is called decompressor.
There are two broad categories of data compression algorithms: lossless and lossy, depending on whether information is lost. Lossless data compression algorithms (such as PNG) are reversible (there is no loss in quality); you can reconstruct the original data. Lossless compression works by removing redundant data.
A Weissman score is a (fictional) test to see the efficiency of a compression algorithm. It was created by Stanford electrical engineering professor Tsachy Weissman and Ph. D.
The Lempel Ziv Oberhumer (LZO), a lightweight compression algorithm using dictionary encoding, is currently one of the fastest algorithms and is widely used.
I think you should sort a set according to item frequency and this get a good heuristics as you suspect. The same approach using in FP-growth (frequent patterns mining) for representing in compact way the items sets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With