Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I analyze or improve my niece's simple compression algorithm that is based on Morse code?

My 8 year old niece had a lesson on Morse code in school yesterday, and her assignment was to convert various phrases to Morse code. One of the phrases included her age, and instead of writing ---.., she wrote 3-2. because (in her words), "it's less writing that way." This rudimentary "compression algorithm" sparked my curiosity, so I wrote a bit of code to implement it.

However, we made a few changes along the way. I pointed out to her that if you wrote just .....-----, there isn't any way to tell if the writer meant 50 or eeeeettttt. In reality, there is a pause in between each letter of each word and every word so this isn't a problem, but our scheme didn't have that. I pulled out some graph paper and suggested padding the Morse code for each symbol with another symbol to facilitate coding and remove ambiguity from the scheme. My nice suggested using + because "no one ever writes those in sentences." (Ouch, I recently graduated with a math degree, but fair enough.)

Since some of us do write with +, and we all use hyphens and periods/dots, which would conflict with our standard definition of Morse code, these symbols are replaced with p, h, and d, respectively. Of course, this brings us to the problem of what to do with symbols that aren't defined in our extended Morse code. My niece wanted to simply ignore them, so that's what we did. For the sake of case sensitive preservation of the textual message, upper case letters aren't lower cased in the code; they're just carried through as is and padded with +.

Summary of the algorithm:

  1. Morse codes are right-padded to 5 characters with +
  2. We extended Morse code to substitute p for +, d for ., and h for -.
  3. Symbols that aren't defined in our "extended" Morse code are passed through intact.
  4. Runs of symbols are replaced with unless only one consecutive character occurs, in which case the number is omitted.

Potential pitfalls:

  1. My padding scheme probably reduces the effectiveness of the compression.
  2. Compression might be improved by using blocks larger than 5 characters
  3. If either my niece or I knew anything about compression algorithms, we could probably use that makes them successful.
  4. This obviously isn't suitable for production, but since there are numerous efficient compression algorithms for such purposes, I'm ignoring that problem for the time being.
  5. ???

Example:

In our algorithm, "Hello, World" translates to

H++++.++++.-..+.-..+---++,+++++++++W++++---++.-.++.-..+-..++

and compresses to

H4+.4+.-2.+.-2.+3-2+,9+W4+3-2+.-.2+.-2.+-2.2+

Here is the Python code I threw together:

#!/usr/bin/python3

import itertools
import string

class MorseString(str):
    def __init__(self, string):
        # or, pad values during iteration but this seems neater
        self._char_morse_map = {"a":".-+++", "b":"-...+", "c":"-.-.+", "d":"-..++", \
                                "e":".++++", "f":"..-.+", "g":"--.++", "h":"....+", \
                                "i":"..+++", "j":".---+", "k":"-.-++", "l":".-..+", \
                                "m":"--+++", "n":"-.+++", "o":"---++", "p":".--.+", \
                                "q":"--.-+", "r":".-.++", "s":"...++", "t":"-++++", \
                                "u":"..-++", "v":"...-+", "w":".--++", "x":"-..-+", \
                                "y":"-.--+", "z":"--..+", "1":".----", "2":"..---", \
                                "3":"...--", "4":"....-", "5":".....", "6":"-....", \
                                "7":"--...", "8":"---..", "9":"----.", "0":"-----",
                                " ":"+++++", ".":"d++++", "+":"p++++", "-":"h++++"}
        
        self._morse_char_map = dict()
        for letter, code in self._char_morse_map.items():
            self._morse_char_map[code] = letter

        self._string = string

        # convert the string to "Morse code". Could also use c.lower()
        self._morse_string = "".join([self._char_morse_map.get(c, c.ljust(5, "+")) for c in self._string])

    def compress(self):
        def grouper(n, k):
            return str(n) + k if n > 1 else k

        # could just use lambda
        return "".join([grouper(len(list(g)), k) for k, g in itertools.groupby(self._morse_string)])

    def decompress(self):
        i = 0
        start = 0
        chars = list()
        sc = self.compress()
        while i < len(sc):
            curr = sc[i]
            i += 1
            if not(curr in string.digits):
                num = 1 if start + 1 == i else int(sc[start:i-1])
                chars.append("".join(curr * num))
                start = i

        code = "".join(chars)
        chars = list()

        for i in range(0, len(code), 5):
            piece = "".join(code[i:i+5])
            chars.append(self._morse_char_map.get(piece, piece[0]))
            
        return "".join(chars)


def main():
    s0 = "Hello, World"
    ms0 = MorseString(s0)
    print(ms0._morse_string)
    print(ms0.compress())
    assert(s0 == ms0.decompress())
    
    s1 = "Hello  2 world."
    ms1 = MorseString(s1)
    assert(s1 == ms1.decompress())

    s2 = "The quick brown fox jumped over the lazy dog."
    ms2 = MorseString(s2)
    assert(s2 == ms2.decompress())

    s3 = "abc  -.+"
    ms3 = MorseString(s3)
    ms3
    assert(s3 == ms3.decompress())

if __name__ == "__main__":
    main()

What are some simple methods that would a) improve our algorithm, and b) be relatively easy to explain to my 8-year old niece? While the last point is clearly subjective, I'm nevertheless trying to indulge her curiosity as much as possible.

I welcome any improvements to the code as well since it's not structured terribly well (I'm fairly certain it's structured quite poorly, actually, but it's quick and dirty), although that's strictly for my benefit since I haven't gotten my niece to use Python (YET).

Update

Here is an updated version of the code, which attempts to incorporate both user1884905's modifications to the algorithm and Karl's improvements to the code itself.

import itertools
import string

_char_morse_map = {"a":".-", "b":"-...", "c":"-.-.", "d":"-..", \
                   "e":".", "f":"..-.", "g":"--.", "h":"....", \
                   "i":"..", "j":".---", "k":"-.-", "l":".-..", \
                   "m":"--", "n":"-.", "o":"---", "p":".--.", \
                   "q":"--.-", "r":".-.", "s":"...", "t":"-", \
                   "u":"..-", "v":"...-", "w":".--", "x":"-..-", \
                   "y":"-.--", "z":"--..", "1":".----", "2":"..---", \
                   "3":"...--", "4":"....-", "5":".....", "6":"-....", \
                   "7":"--...", "8":"---..", "9":"----.", "0":"-----",
                   " ":"", ".":"d", "+":"p", "-":"h"}

_morse_char_map = {
    code: letter
    for letter, code in _char_morse_map.items()
}


def encode(s):
    return "".join(_char_morse_map.get(c, c) + "+" for c in s)

def decode(encoded):
    return "".join(decode_gen(encoded))

def decode_gen(encoded):
    word = ""
    for c in encoded:
        if c != "+":
            word += c
        else:
            yield _morse_char_map.get(word, word) if word != "" else " "
            word = ""

def compress(s):
    def grouper(n, k):
        return str(n) + k if n > 1 else k

    return "".join(grouper(len(list(g)), k) for k, g in itertools.groupby(s))

def decompress(compressed):
    return "".join(decompress_gen(compressed))

def decompress_gen(compressed):
    digits = ""
    for c in compressed:
        if c in string.digits:
            digits += c
        else:
            number = int(digits) if digits else 1
            yield "".join(c * number)
            digits = ""
like image 690
John Bensin Avatar asked Mar 28 '13 03:03

John Bensin


People also ask

Is there a perfect compression algorithm?

No. It can be proven that there is not even an algorithm to determine how well a perfect compressor will do. See Kolmogorov Complexity. Huffman coding (or arithmetic coding) by itself does not get close to the best compression.

What is compression decompression algorithm?

Compression algorithms are normally used to reduce the size of a file without removing information. This can increase their entropy and make the files appear more random because all of the possible bytes become more common.

What are compression algorithms written in?

C is a great choice for writing a compression program. You can use plenty of other languages too, though.


1 Answers

A simple change that can be made to compress your output (at least in most cases) is to keep the idea of a pause between two letters and to let your +-signs denote this pause (i.e. the pluses are used as a new-character-character instead of using them as padding).

This will make all numbers 0-9 one character longer but make quite a few commonly occurring letters shorter.

For example a, e, i and t will become .-+, .+, ..+ and -+ instead of .-+++, .++++, ..+++ and -++++. (And space can be denoted + instead of +++++)

So your Hello, World example would become:

H+.+.-..+.-..+---+,++W+---+.-.+.-..+-..+

instead of

H++++.++++.-..+.-..+---++,+++++++++W++++---++.-.++.-..+-..++

and compress to

H+.+.-2.+.-2.+3-+,2+W+3-+.-.+.-2.+-2.+

instead of

H4+.4+.-2.+.-2.+3-2+,9+W4+3-2+.-.2+.-2.+-2.2+

Understanding this reasoning, Huffman encoding will only seem like the natural next step, where the basic idea is to make the most common characters take up as little amount of space as possible.

Edit:

See also this wikipedia-image of a dichotomatic search table showing the most commonly occurring characters close to the top of the tree.

Morse code tree

like image 105
user1884905 Avatar answered Oct 27 '22 14:10

user1884905