Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write Huffman coding to a file using Python?

I created a Python script to compress text by using the Huffman algorithm. Say I have the following string:

string = 'The quick brown fox jumps over the lazy dog'

Running my algorithm returns the following 'bits':

result = '01111100111010101111010011111010000000011000111000010111110111110010100110010011010100101111100011110001000110101100111101000010101101110110111000111010101110010111111110011000101101000110111000'

By comparing the amount of bits of the result with the input string, the algorithm seems to work:

>>> print len(result), len(string) * 8
194 344

But now comes the question: how do I write this to a file, while still being able to decode it. You can only write to a file per byte, not per bit. By writing the 'codes' as bytes, there is no compression at all!

I am new at computer science, and the online resources just don't cut it for me. All help is much appreciated!

Edit: note that I had my codes something like this (in case of another input string 'xxxxxxxyzz'):

{'y': '00', 'x': '1', 'z': '10'}

The way I create the resulting string is by concatenating these codes in order of the input string:

result = '1111111001010'

How to get back to the original string from this result? Or am I getting this completely wrong? Thank you!

like image 354
Pim Klaassen Avatar asked Jul 19 '18 14:07

Pim Klaassen


People also ask

What is Huffman coding explain with example?

Huffman coding is a lossless data compression algorithm. In this algorithm, a variable-length code is assigned to input different characters. The code length is related to how frequently characters are used. Most frequent characters have the smallest codes and longer codes for least frequent characters.

How is Huffman coding used to compress data?

Huffman coding is a form of lossless compression which makes files smaller using the frequency with which characters appear in a message. This works particularly well when characters appear multiple times in a string as these can then be represented using fewer bits . This reduces the overall size of a file.


1 Answers

First you need to convert your input string to bytes:

def _to_Bytes(data):
  b = bytearray()
  for i in range(0, len(data), 8):
    b.append(int(data[i:i+8], 2))
  return bytes(b)

Then, open a file to write in binary mode:

result = '01111100111010101111010011111010000000011000111000010111110111110010100110010011010100101111100011110001000110101100111101000010101101110110111000111010101110010111111110011000101101000110111000'
with open('test.bin', 'wb') as f:
  f.write(_to_Bytes(result))

Now, writing the original string to a file, a comparison of bytes can take place:

import os
with open('test_compare.txt', 'a') as f:
  f.write('The quick brown fox jumps over the lazy dog')

_o = os.path.getsize('test_compare.txt')
_c = os.path.getsize('test.bin')
print(f'Original file: {_o} bytes')
print(f'Compressed file: {_c} bytes')
print('Compressed file to about {}% of original'.format(round((((_o-_c)/_o)*100), 0)))

Output:

Original file: 43 bytes
Compressed file: 25 bytes
Compressed file to about 42.0% of original

To get back to the original, you can write a function that determines the possible ordering of characters:

d = {'y': '00', 'x': '1', 'z': '10'}
result = '1111111001010'
from typing import Generator
def reverse_encoding(content:str, _lookup) -> Generator[str, None, None]:
  while content:
    _options = [i for i in _lookup if content.startswith(i) and (any(content[len(i):].startswith(b) for b in _lookup) or not content[len(i):])]
    if not _options:
      raise Exception("Decoding error")
    yield _lookup[_options[0]]
    content = content[len(_options[0]):]

print(''.join(reverse_encoding(result, {b:a for a, b in d.items()})))

Output:

'xxxxxxxyzz'
like image 130
Ajax1234 Avatar answered Sep 29 '22 00:09

Ajax1234