Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does the order of data in a text file affects its compression ratio?

I have 2 large text files (csv, to be precise). Both have the exact same content except that the rows in one file are in one order and the rows in the other file are in a different order.

When I compress these 2 files (programmatically, using DotNetZip) I notice that always one of the files is considerably bigger -for example, one file is ~7 MB bigger compared to the other.-

My questions are:

How does the order of data in a text file affect compression and what measures can one take in order to guarantee the best compression ratio? - I presume that having similar rows grouped together (at least in the case of ZIP files, which is what I am using) would help compression but I am not familiar with the internals of the different compression algorithms and I'd appreciate a quick explanation on this subject.

Which algorithm handles this sort of scenario better in the sense that would achieve the best average compression regardless of the order of the data?

like image 909
Icarus Avatar asked Feb 14 '13 18:02

Icarus


2 Answers

"How" has already been answered. To answer your "which" question:

The larger the window for matching, the less sensitive the algorithm will be to the order. However all compression algorithms will be sensitive to some degree.

gzip has a 32K window, bzip2 a 900K window, and xz an 8MB window. xz can go up to a 64MB window. So xz would be the least sensitive to the order. Matches that are further away will take more bits to code, so you will always get better compression with, for example, sorted records, regardless of the window size. Short windows simply preclude distant matches.

like image 140
Mark Adler Avatar answered Nov 08 '22 07:11

Mark Adler


In some sense, it is the measure of the entropy of the file defines how well it will compress. So, yes, the order definitely matters. As a simple example, consider a file filled with values abcdefgh...zabcd...z repeating over and over. It would compress very well with most algorithms because it is very ordered. However, if you completely randomize the order (but leave the same count of each letter), then it has the exact same data (although a different "meaning"). It is the same data in a different order, and it will not compress as well.

In fact, because I was curious, I just tried that. I filled an array with 100,000 characters a-z repeating, wrote that to a file, then shuffled that array "randomly" and wrote it again. The first file compressed down to 394 bytes (less than 1% of the original size). The second file compressed to 63,582 bytes (over 63% of the original size).

like image 43
Mark Wilkins Avatar answered Nov 08 '22 08:11

Mark Wilkins