Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compress small strings, with what to create external dictionary?

I want to compress much small strings (about 75-100 length c# string). At the time the dictionary is created I already know all short strings (nearly a trillion). There will no additional short strings in future. I need to extra exactly one string without decompress other strings.

Now I am looking for a library or the best way to do the following:

  1. Create a dictionary using all strings I have
  2. Using this dictionary to compress each string
  3. a way to compress one string using the dictionary from 1.

I found a good related question, but this is not c# specific. Maybe there is something for c# I do not know, or a fancy library or someone has already done that. That is the reason I ask this question.

EDIT:

With dictionary I am talking about things like this: http://en.wikipedia.org/wiki/Dictionary_coder But everything helps to get the strings shorter. The strings are short text messages in various languages and URLs (30%/70%). There is no need that the compressed strings is human readable. It will be stored in binary files.

like image 758
Chris Avatar asked Jun 04 '12 22:06

Chris


People also ask

How do you compress a string in Python?

Algorithm for string compression in pythonPick the first character from the input string ( str ). Append it to the compressed string. Count the number of subsequent occurrences of the character (in str) and append the count to the compressed string if it is more than 1 only​.

How do I compress a string in Java?

string compression in java can be performed using a ZLIB compression library. It offers some distinct features to effectively compress string data in java. Although the compression rate could vary based on the factors such as the amount of compression required, length of data and repetitions in string data.

Is it possible to compress a string?

"String Compression Algorithm” or “Run Length Encoding” happens when you compress a string, and the consecutive duplicates of each string are replaced with the character, followed by the consecutive, repeated character count. For example: After string compression, the string “aaaabbcddddd” would return “a4b2c1d5”.

How do I compress a string in C++?

Compress String in C++ Suppose we have a string s, we have to eliminate consecutive duplicate characters from the given string and return it. So, if a list contains consecutive repeated characters, they should be replaced with a single copy of the character. The order of the elements will be same as before.


2 Answers

If there are a trillion strings and no more, then each can be represented in 40 bits (5 bytes). All you need is a way to use the 5-bytes as an index to the trillion strings.

How do you know all trillion strings? If the compressor and decompressor both have access to all trillion strings, or if there is way to order and recreate the strings, then all you need is the index.

If you can't find a way to index the strings, then you can take a subset of the strings and use them as a dictionary for a compressor. Just take the most representative sample (you need to figure out what might make some of the strings more common than the other strings or more representative of the other strings) and concatenate them into a 32K dictionary. About 400 of your trillion strings. Then zlib's deflateSetDictionary on the compress end and inflateSetDictionary on the decompress end, both using exactly the same 32K dictionary. That will provide good compression on the short strings.

like image 64
Mark Adler Avatar answered Oct 03 '22 03:10

Mark Adler


I haven't used it, but Smaz sounds promising for this...

Smaz is a simple compression library suitable for compressing very short strings. General purpose compression libraries will build the state needed for compressing data dynamically, in order to be able to compress every kind of data. This is a very good idea, but not for a specific problem: compressing small strings will not work.

Smaz instead is not good for compressing general purpose data, but can compress text by 40-50% in the average case (works better with English), and is able to perform a bit of compression for HTML and urls as well. The important point is that Smaz is able to compress even strings of two or three bytes!

For example the string "the" is compressed into a single byte.

Since it's written in C, check out Bart De Smet's example for interoping with C through C#.

like image 40
Steve Wortham Avatar answered Oct 03 '22 04:10

Steve Wortham