Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Compress Ascii String

I'm looking for a way to compress an ascii-based string, any help?

I also need to decompress it. I tried zlib but with no help.

What can I do to compress the string into lesser length?

code:

def compress(request):
    if request.POST:
        data = request.POST.get('input')
        if is_ascii(data):
            result = zlib.compress(data)
            return render_to_response('index.html', {'result': result, 'input':data}, context_instance = RequestContext(request))
        else:
            result = "Error, the string is not ascii-based"
            return render_to_response('index.html', {'result':result}, context_instance = RequestContext(request))
    else:
        return render_to_response('index.html', {}, context_instance = RequestContext(request))
like image 858
min Avatar asked Oct 13 '12 09:10

min


2 Answers

Using compression will not always reduce the length of a string!

Consider the following code;

import zlib
import bz2

def comptest(s):
    print 'original length:', len(s)
    print 'zlib compressed length:', len(zlib.compress(s))
    print 'bz2 compressed length:', len(bz2.compress(s))

Let's try this on an empty string;

In [15]: comptest('')
original length: 0
zlib compressed length: 8
bz2 compressed length: 14

So zlib produces an extra 8 characters, and bz2 14. Compression methods usually put a 'header' in front of the compressed data for use by the decompression program. This header increases the length of the output.

Let's test a single word;

In [16]: comptest('test')
original length: 4
zlib compressed length: 12
bz2 compressed length: 40

Even if you would substract the length of the header, the compression hasn't made the word shorter at all. That is because in this case there is little to compress. Most of the characters in the string occur only once. Now for a short sentence;

In [17]: comptest('This is a compression test of a short sentence.')
original length: 47
zlib compressed length: 52
bz2 compressed length: 73

Again the compression output is larger than the input text. Due to the limited length of the text, there is little repetition in it, so it won't compress well.

You need a fairly long block of text for compression to actually work;

In [22]: rings = '''
   ....:     Three Rings for the Elven-kings under the sky, 
   ....:     Seven for the Dwarf-lords in their halls of stone, 
   ....:     Nine for Mortal Men doomed to die, 
   ....:     One for the Dark Lord on his dark throne 
   ....:     In the Land of Mordor where the Shadows lie. 
   ....:     One Ring to rule them all, One Ring to find them, 
   ....:     One Ring to bring them all and in the darkness bind them 
   ....:     In the Land of Mordor where the Shadows lie.'''

In [23]: comptest(rings)                       
original length: 410
zlib compressed length: 205
bz2 compressed length: 248
like image 152
Roland Smith Avatar answered Sep 24 '22 13:09

Roland Smith


You don't even need you data to be ascii, you can feed zlib with anything

>>> import zlib
>>> a='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' # + any binary data you want
>>> print zlib.compress(a)
x�KL$
�
>>>

What you probably want here - compressed data to be ascii string? Am I right here?
If so - you should know that you have very small alphabet to code compressed data => so you'd have more symbols used.

For example to code binary data in base64 (you will get ascii string) but you will use ~30% more space for that

like image 40
Sergey Avatar answered Sep 25 '22 13:09

Sergey