I've been doing some research on compression-based text classification and I'm trying to figure out a way of storing a dictionary built by the encoder (on a training file) for use to run 'statically' on a test file? Is this at all possible using UNIX's gzip utility?
For example I have been using 2 'class' files of sport.txt and atheism.txt, hence I want to run compression on both of these files and store their dictionaries used. Next I want to take a test file (which is unlabelled, could be either atheism or sport) and by using the prebuilt dictionaries on this test.txt I can analyse how well it compresses under that dictionary/model.
Thanks
deflate encoders, as in gzip and zlib, do not "build" a dictionary. They simply use the previous 32K bytes as a source for potential matches to the string of bytes starting at the current position. The last 32K bytes is called the "dictionary", but the name is perhaps misleading.
You can use zlib to experiment with preset dictionaries. See the deflateSetDictionary()
and inflateSetDictionary()
functions. In that case, zlib compression is primed with a "dictionary" of 32K bytes that effectively precede the first byte being compressed as a source for matches, but the dictionary itself is not compressed. The priming can only improve the compression of the first 32K bytes. After that, the preset dictionary is too far back to provide matches.
gzip provides no support for preset dictionaries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With