Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithms for compression of set tries

I have a collection of sets that I'd like to place in a trie.

Normal tries are made of strings of elements - that is, the order of the elements is important. Sets lack a defined order, so there's the possibility of greater compression.

For example, given the strings "abc", "bc", and "c", I'd create the trie:

(*,3) -> ('a',1) -> ('b',1) -> ('c',1)
      -> ('b',1) -> ('c',1)
      -> ('c',1)

But given the sets { 'a', 'b', 'c' }, { 'b', 'c' }, { 'c' }, I could create the above trie, or any of these eleven:

(*,3) -> ('a',1) -> ('b',1) -> ('c',1)
      -> ('c',2) -> ('a',1)

(*,3) -> ('a',1) -> ('c',1) -> ('b',1)
      -> ('b',1) -> ('c',1)
      -> ('c',1)

(*,3) -> ('a',1) -> ('c',1) -> ('b',1)
      -> ('c',2) -> ('a',1)

(*,3) -> ('b',2) -> ('a',1) -> ('c',1)
                 -> ('c',1)
      -> ('c',1)

(*,3) -> ('b',1) -> ('a',1) -> ('c',1)
      -> ('c',2) -> ('b',1)

(*,3) -> ('b',2) -> ('c',2) -> ('a',1)
      -> ('c',1)

(*,3) -> ('b',1) -> ('c',1) -> ('a',1)
      -> ('c',2) -> ('b',1)

(*,3) -> ('c',2) -> ('a',1) -> ('b',1)
      -> ('b',1) -> ('c',1)

(*,3) -> ('c',2) -> ('a',1) -> ('b',1)
                 -> ('b',1)

(*,3) -> ('c',2) -> ('b',1) -> ('a',1)
      -> ('b',1) -> ('c',1)

(*,3) -> ('c',3) -> ('b',2) -> ('a',1)

So there's obviously room for compression (7 nodes to 4).

I suspect defining a local order at each node dependent on the relative frequency of its children would do it, but I'm not certain, and it might be overly expensive.

So before I hit the whiteboard, and start cracking away at my own compression algorithm, is there an existing one? How expensive is it? Is it a bulk process, or can it be done per-insert/delete?

like image 234
rampion Avatar asked Feb 22 '12 23:02

rampion


People also ask

Which algorithm is used for compression?

A compression algorithm is often called compressor and the decompression algorithm is called decompressor.

What are the two common types of compression algorithms?

There are two broad categories of data compression algorithms: lossless and lossy, depending on whether information is lost. Lossless data compression algorithms (such as PNG) are reversible (there is no loss in quality); you can reconstruct the original data. Lossless compression works by removing redundant data.

Is a Weissman score real?

A Weissman score is a (fictional) test to see the efficiency of a compression algorithm. It was created by Stanford electrical engineering professor Tsachy Weissman and Ph. D.

Which compression algorithm is fastest?

The Lempel Ziv Oberhumer (LZO), a lightweight compression algorithm using dictionary encoding, is currently one of the fastest algorithms and is widely used.


1 Answers

I think you should sort a set according to item frequency and this get a good heuristics as you suspect. The same approach using in FP-growth (frequent patterns mining) for representing in compact way the items sets.

like image 199
Alexander Kuznetsov Avatar answered Sep 23 '22 02:09

Alexander Kuznetsov