Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compressed string storage

Lets say I have many objects containing strings of non-trivial length (around ~3-4kb). The strings are all different from each other yet at the same time contain lots of common parts/subsequences. On average maybe 80-90% of any individual string is contained withing the others as well. Is there an easy way to automatically exploit this huge redundancy for compressing the data?
Ideally the solution would be C++ and transparent for the user (i.e. I can use it as if I was accessing a regular read only const std::string but instead reading from compressed storage).

like image 616
BuschnicK Avatar asked Dec 03 '10 09:12

BuschnicK


People also ask

What is compressed string?

"String Compression Algorithm” or “Run Length Encoding” happens when you compress a string, and the consecutive duplicates of each string are replaced with the character, followed by the consecutive, repeated character count. For example: After string compression, the string “aaaabbcddddd” would return “a4b2c1d5”.

What is compact string?

Compact String – Java 9 Java 9 has brought the concept of compact Strings back. This means that whenever we create a String if all the characters of the String can be represented using a byte — LATIN-1 representation, a byte array will be used internally, such that one byte is given for one character.

How do I compress a string in Java?

Deflator is one of most used class for string compression in java. It uses the popular ZLIB compression library. It provides the function called deflator() to compress string. The function first takes the input string data, performs the compression and then fills the given buffer with the compressed data.


1 Answers

Algorithmically, Lempel–Ziv–Welch with one dictionary for all objects/strings might be a good start.

like image 100
NPE Avatar answered Sep 20 '22 20:09

NPE