I need to be able to search for text in a large number of files (.txt) that are zipped. Compression may be changed to something else or even became proprietary. I want to avoid unpacking all files and compress (encode) the search string and search in compressed files. This should be possible using Huffman compression with the same codebook for all files. I don't want to re-invent the wheel, so .. anyone knows a library that does something like this or Huffman algorithm that is implemented and tested, or maybe a better idea ?
thanks in advance
Most text files are compressed with one of the LZ-family of algorithms, which combine a Dictionary Coder together with an Entropy Coder such as Huffman.
Because the Dictionary Coder relies on a continuously-updated "dictionary", its coding result is dependent on the history (all codes in the dictionary that is derived from the input data up to the current symbol), so it is not possible to jump into a certain location and start decoding, without first decoding all of the previous data.
In my opinion, you can just use a zlib stream decoder which returns decompressed data as it goes without waiting for the entire file to be decompressed. This will not save execution time but will save memory.
A second suggestion is to do Huffman coding on English words, and forget about the Dictionary Coder part. Each English word gets mapped to a unique prefix-free code.
Finally, @SHODAN gave the most sensible suggestion, which is to index the files, compress the index and bundle with the compressed text files. To do a search, decompress just the index file and look up the words. This is in fact an improvement over doing the Huffman coding on words - once you found the frequency of words (in order to assign the prefix code optimally), you have already built the index, so you can keep the index for searching.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With