I have a large text file which mainly consists of numbers and some delimiters like ,|{}[]: etc. I used Lempel-Ziv encoding for compression. The code I used is not mine and is the one from Rosetta code. I ran the code for line by line compression as well as once for chunk by chunk compression:
def readChunk(file_object, size = 1024):
while True:
data = file_object.read(size)
if not data:
break
yield data
def readByChunk():
with open(LARGE_FILE, 'r') as f:
for data in readChunk(f, 2048):
compressed_chunk = compress(data)
compressed_chunk = map(lambda a : str(a), compressed_chunk)
comp_file.write(" ".join(compressed_chunk))
def readLineByLine():
with open(LARGE_FILE, 'r') as f:
lines = f.readlines()
for data in lines:
compressed_line = compress(data)
compressed_line = map(lambda a : str(a), compressed_line)
comp_file.write(" ".join(compressed_line))
Both function output a file that is much bigger than the original file!! Decompression works fine i.e. I am able to get the original text back so I think the code is correct.
Am I doing something wrong in saving the file?
The compressor you are using is terrible. Try zlib.compress instead.
The general answer is "when the data is random bits", or already compressed. 99% of other normal things will compress just fine. For ascii data (like the data you say you are using) really trivial compressors suffice, just Huffman encoding it gets you a decent boost and you're saying you only use like a dozen unique characters.
Which means that either you have a bunch of random data that you're not telling us about or there's a bug in the compressor.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With