Lets say I have many objects containing strings of non-trivial length (around ~3-4kb). The strings are all different from each other yet at the same time contain lots of common parts/subsequences. On average maybe 80-90% of any individual string is contained withing the others as well. Is there an easy way to automatically exploit this huge redundancy for compressing the data?
Ideally the solution would be C++ and transparent for the user (i.e. I can use it as if I was accessing a regular read only const std::string but instead reading from compressed storage).
"String Compression Algorithm” or “Run Length Encoding” happens when you compress a string, and the consecutive duplicates of each string are replaced with the character, followed by the consecutive, repeated character count. For example: After string compression, the string “aaaabbcddddd” would return “a4b2c1d5”.
Compact String – Java 9 Java 9 has brought the concept of compact Strings back. This means that whenever we create a String if all the characters of the String can be represented using a byte — LATIN-1 representation, a byte array will be used internally, such that one byte is given for one character.
Deflator is one of most used class for string compression in java. It uses the popular ZLIB compression library. It provides the function called deflator() to compress string. The function first takes the input string data, performs the compression and then fills the given buffer with the compressed data.
Algorithmically, Lempel–Ziv–Welch with one dictionary for all objects/strings might be a good start.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With