I created a Python script to compress text by using the Huffman algorithm. Say I have the following string:
string = 'The quick brown fox jumps over the lazy dog'
Running my algorithm returns the following 'bits':
result = '01111100111010101111010011111010000000011000111000010111110111110010100110010011010100101111100011110001000110101100111101000010101101110110111000111010101110010111111110011000101101000110111000'
By comparing the amount of bits of the result with the input string, the algorithm seems to work:
>>> print len(result), len(string) * 8
194 344
But now comes the question: how do I write this to a file, while still being able to decode it. You can only write to a file per byte, not per bit. By writing the 'codes' as bytes, there is no compression at all!
I am new at computer science, and the online resources just don't cut it for me. All help is much appreciated!
Edit: note that I had my codes something like this (in case of another input string 'xxxxxxxyzz'
):
{'y': '00', 'x': '1', 'z': '10'}
The way I create the resulting string is by concatenating these codes in order of the input string:
result = '1111111001010'
How to get back to the original string from this result? Or am I getting this completely wrong? Thank you!
Huffman coding is a lossless data compression algorithm. In this algorithm, a variable-length code is assigned to input different characters. The code length is related to how frequently characters are used. Most frequent characters have the smallest codes and longer codes for least frequent characters.
Huffman coding is a form of lossless compression which makes files smaller using the frequency with which characters appear in a message. This works particularly well when characters appear multiple times in a string as these can then be represented using fewer bits . This reduces the overall size of a file.
First you need to convert your input string to bytes:
def _to_Bytes(data):
b = bytearray()
for i in range(0, len(data), 8):
b.append(int(data[i:i+8], 2))
return bytes(b)
Then, open a file to write in binary mode:
result = '01111100111010101111010011111010000000011000111000010111110111110010100110010011010100101111100011110001000110101100111101000010101101110110111000111010101110010111111110011000101101000110111000'
with open('test.bin', 'wb') as f:
f.write(_to_Bytes(result))
Now, writing the original string to a file, a comparison of bytes can take place:
import os
with open('test_compare.txt', 'a') as f:
f.write('The quick brown fox jumps over the lazy dog')
_o = os.path.getsize('test_compare.txt')
_c = os.path.getsize('test.bin')
print(f'Original file: {_o} bytes')
print(f'Compressed file: {_c} bytes')
print('Compressed file to about {}% of original'.format(round((((_o-_c)/_o)*100), 0)))
Output:
Original file: 43 bytes
Compressed file: 25 bytes
Compressed file to about 42.0% of original
To get back to the original, you can write a function that determines the possible ordering of characters:
d = {'y': '00', 'x': '1', 'z': '10'}
result = '1111111001010'
from typing import Generator
def reverse_encoding(content:str, _lookup) -> Generator[str, None, None]:
while content:
_options = [i for i in _lookup if content.startswith(i) and (any(content[len(i):].startswith(b) for b in _lookup) or not content[len(i):])]
if not _options:
raise Exception("Decoding error")
yield _lookup[_options[0]]
content = content[len(_options[0]):]
print(''.join(reverse_encoding(result, {b:a for a, b in d.items()})))
Output:
'xxxxxxxyzz'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With