Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Re-order lines in textfile for better compression ratio

I have lots of huge text-files which need to be compressed with the highest ratio possible. Compression speed may be slow, as long as decompression is reasonably fast.

Each line in these files contains one dataset, and they can be stored in any order.

A Similar problem to this one: Sorting a file to optimize for compression efficiency

But for me compression speed is not an issue. Are there ready-to-use tools to group similar lines together? Or maybe just an algorithm I can implement?

Sorting alone gave some improvement, but I suspect a lot more is possible.

Each file is about 600 million lines long, ~40 bytes each, 24GB total. Compressed to ~10GB with xz

like image 271
Craden Avatar asked Nov 08 '22 20:11

Craden


1 Answers

Here's a fairly naïve algorithm:

  • Choose an initial line at random and write to the compression stream.
  • While remaining lines > 0:
    • Save the state of the compression stream
    • For each remaining line in the text file:
      • write the line to the compression stream and record the resulting compressed length
      • roll back to the saved state of the compression stream
    • Write the line that resulted in the lowest compressed length to the compression stream
    • Free the saved state

This is a greedy algorithm and won't be globally optimal, but it should be pretty good at matching together lines that compressed well when followed one after the other. It's O(n2) but you said compression speed wasn't an issue. The main advantage is that it's empirical: it doesn't rely on assumptions about which line order will compress well but actually measures it.

If you use zlib, it provides a function deflateCopy that duplicates the state of the compression stream, although it's apparently pretty expensive.

Edit: if you approach this problem as outputting all lines in a sequence while trying to minimize the total edit distance between all pairs of lines in the sequence then this problem reduces to the Travelling Salesman Problem, with the edit distance as your "distance" and all your lines as the nodes you have to visit. So you could look into the various approaches to that problem and apply them to this. Even then, the optimal TSP solution in terms of edit distance isn't necessarily going to be the file that compresses the smallest/

like image 132
samgak Avatar answered Nov 15 '22 07:11

samgak