I have lots of huge text-files which need to be compressed with the highest ratio possible. Compression speed may be slow, as long as decompression is reasonably fast.
Each line in these files contains one dataset, and they can be stored in any order.
A Similar problem to this one: Sorting a file to optimize for compression efficiency
But for me compression speed is not an issue. Are there ready-to-use tools to group similar lines together? Or maybe just an algorithm I can implement?
Sorting alone gave some improvement, but I suspect a lot more is possible.
Each file is about 600 million lines long, ~40 bytes each, 24GB total. Compressed to ~10GB with xz
Here's a fairly naïve algorithm:
This is a greedy algorithm and won't be globally optimal, but it should be pretty good at matching together lines that compressed well when followed one after the other. It's O(n2) but you said compression speed wasn't an issue. The main advantage is that it's empirical: it doesn't rely on assumptions about which line order will compress well but actually measures it.
If you use zlib, it provides a function deflateCopy that duplicates the state of the compression stream, although it's apparently pretty expensive.
Edit: if you approach this problem as outputting all lines in a sequence while trying to minimize the total edit distance between all pairs of lines in the sequence then this problem reduces to the Travelling Salesman Problem, with the edit distance as your "distance" and all your lines as the nodes you have to visit. So you could look into the various approaches to that problem and apply them to this. Even then, the optimal TSP solution in terms of edit distance isn't necessarily going to be the file that compresses the smallest/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With