Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matches overlapping lookahead on LZ77/LZSS with suffix trees

Background: I have an implementation of a generic LZSS backend on C++ (available here. The matching algorithm I use in this version is exceedingly simple, because it was originally meant to compress relatively small files (at most 64kB) for relatively ancient hardware (specifically, the Mega Drive/Sega Genesis, where 64kB is the entirety of main RAM).

Nevertheless, some files take far too long to compress on my implementation, on the order of minutes. The reason is twofold: the naïve matching algorithm takes most of the time, but this happens specifically because I construct a compression graph from the file to achieve optimal compression. Looking on the profiler, most of the time is spent looking for matches, dwarfing even the quadratic size of the resulting graph.

For some time, I have been studying several potential replacements; one that drew my attention was dictionary-symbolwise flexible parsing using multilayer suffix trees. The multilayer part is important because one of the variants of LZSS I am interested in uses variable size encodings for (position, length).

My current implementation allows matches in the sliding window to overlap the look-ahead buffer, so that this input:

aaaaaaaaaaaaaaaa

can be directly encoded as

(0,'a')(1,0,15)

instead of

(0,'a')(1,0,1)(1,0,2)(1,0,4)(1,0,8)

Here, (0,'a') means encoding character 'a' as a literal, while (1,n,m) means 'copy m characters from position n'.

The question: Having said all that, here is my problem: Every resource I found on suffix trees seem to imply that they can't handle the overlapping case, and instead only allows you to find non-overlapping matches. When suffix trees were involved, research papers, books and even some implementations gave compression examples without the overlap as if they were perfect compression (I would link to some of these but my reputation does not allow it). Some them even mentioned that overlaps could be useful when describing the base compression schemes, but were strangely silent on the matter when discussing suffix trees.

Since the suffix tree needs to be augmented to store offset information anyway, this seems like a property that could be checked while looking for a match — you would filter out any matches that start on the look-ahead buffer. And the way the tree is constructed/updated would mean that if an edge takes you to a node that corresponds to a match starting on the look-ahead, you return the previous node instead as any further descendants will also be on the look-ahead buffer.

Is my approach wrong or incorrect? Is there an implementation or discussion of LZ77/LZSS with suffix trees that mentions matches overlapping the look-ahead buffer?

like image 815
flamewing Avatar asked Jul 10 '15 18:07

flamewing


Video Answer


1 Answers

As I understand it, given a suffix tree, we are interested (roughly) in computing, for each suffix S, which longer suffix has the longest common prefix with S.

Add a reference from each tree node to the descendant leaf with the longest suffix (linear time with DFS). From each leaf, walk rootward, examining the new references, stopping if a longer suffix is found. The running time of the latter step is linear, because each tree edge is examined exactly once.

Life with a bounded window is unfortunately more difficult. Instead of propagating one reference, we propagate several. To compute the set of suffixes referenced from a node, we merge them in sorted order by length. Then whenever we have suffixes of lengths x > y > z, if x - z < 8192 (e.g.), then we can drop y, because all three are equally good for all suffixes with which the current node is the leafmost common ancestor, and if y is in window, then either x or z is. Since the window is a large fraction of the total file, each node will have at most a handful of references.

If you want to look back the shortest distance possible, then there's an O(n log^2 n)-time algorithm (probably improvable to O(n log n) with various hard-to-implement magic). In the course of the algorithm, we construct, for each node, a binary search tree of the descendant suffixes by length, then do next-longest lookups. To construct the node's tree from its childrens', start with the largest child tree and insert the elements from the others. By a heavy path argument, each length is inserted O(log n) times.

like image 132
David Eisenstat Avatar answered Sep 20 '22 12:09

David Eisenstat