My 8 year old niece had a lesson on Morse code in school yesterday, and her assignment was to convert various phrases to Morse code. One of the phrases included her age, and instead of writing ---..
, she wrote 3-2.
because (in her words), "it's less writing that way." This rudimentary "compression algorithm" sparked my curiosity, so I wrote a bit of code to implement it.
However, we made a few changes along the way. I pointed out to her that if you wrote just .....-----
, there isn't any way to tell if the writer meant 50
or eeeeettttt
. In reality, there is a pause in between each letter of each word and every word so this isn't a problem, but our scheme didn't have that. I pulled out some graph paper and suggested padding the Morse code for each symbol with another symbol to facilitate coding and remove ambiguity from the scheme. My nice suggested using +
because "no one ever writes those in sentences." (Ouch, I recently graduated with a math degree, but fair enough.)
Since some of us do write with +
, and we all use hyphens and periods/dots, which would conflict with our standard definition of Morse code, these symbols are replaced with p
, h
, and d
, respectively. Of course, this brings us to the problem of what to do with symbols that aren't defined in our extended Morse code. My niece wanted to simply ignore them, so that's what we did. For the sake of case sensitive preservation of the textual message, upper case letters aren't lower cased in the code; they're just carried through as is and padded with +
.
+
p
for +
, d
for .
, and h
for -
.In our algorithm, "Hello, World" translates to
H++++.++++.-..+.-..+---++,+++++++++W++++---++.-.++.-..+-..++
and compresses to
H4+.4+.-2.+.-2.+3-2+,9+W4+3-2+.-.2+.-2.+-2.2+
Here is the Python code I threw together:
#!/usr/bin/python3
import itertools
import string
class MorseString(str):
def __init__(self, string):
# or, pad values during iteration but this seems neater
self._char_morse_map = {"a":".-+++", "b":"-...+", "c":"-.-.+", "d":"-..++", \
"e":".++++", "f":"..-.+", "g":"--.++", "h":"....+", \
"i":"..+++", "j":".---+", "k":"-.-++", "l":".-..+", \
"m":"--+++", "n":"-.+++", "o":"---++", "p":".--.+", \
"q":"--.-+", "r":".-.++", "s":"...++", "t":"-++++", \
"u":"..-++", "v":"...-+", "w":".--++", "x":"-..-+", \
"y":"-.--+", "z":"--..+", "1":".----", "2":"..---", \
"3":"...--", "4":"....-", "5":".....", "6":"-....", \
"7":"--...", "8":"---..", "9":"----.", "0":"-----",
" ":"+++++", ".":"d++++", "+":"p++++", "-":"h++++"}
self._morse_char_map = dict()
for letter, code in self._char_morse_map.items():
self._morse_char_map[code] = letter
self._string = string
# convert the string to "Morse code". Could also use c.lower()
self._morse_string = "".join([self._char_morse_map.get(c, c.ljust(5, "+")) for c in self._string])
def compress(self):
def grouper(n, k):
return str(n) + k if n > 1 else k
# could just use lambda
return "".join([grouper(len(list(g)), k) for k, g in itertools.groupby(self._morse_string)])
def decompress(self):
i = 0
start = 0
chars = list()
sc = self.compress()
while i < len(sc):
curr = sc[i]
i += 1
if not(curr in string.digits):
num = 1 if start + 1 == i else int(sc[start:i-1])
chars.append("".join(curr * num))
start = i
code = "".join(chars)
chars = list()
for i in range(0, len(code), 5):
piece = "".join(code[i:i+5])
chars.append(self._morse_char_map.get(piece, piece[0]))
return "".join(chars)
def main():
s0 = "Hello, World"
ms0 = MorseString(s0)
print(ms0._morse_string)
print(ms0.compress())
assert(s0 == ms0.decompress())
s1 = "Hello 2 world."
ms1 = MorseString(s1)
assert(s1 == ms1.decompress())
s2 = "The quick brown fox jumped over the lazy dog."
ms2 = MorseString(s2)
assert(s2 == ms2.decompress())
s3 = "abc -.+"
ms3 = MorseString(s3)
ms3
assert(s3 == ms3.decompress())
if __name__ == "__main__":
main()
What are some simple methods that would a) improve our algorithm, and b) be relatively easy to explain to my 8-year old niece? While the last point is clearly subjective, I'm nevertheless trying to indulge her curiosity as much as possible.
I welcome any improvements to the code as well since it's not structured terribly well (I'm fairly certain it's structured quite poorly, actually, but it's quick and dirty), although that's strictly for my benefit since I haven't gotten my niece to use Python (YET).
Here is an updated version of the code, which attempts to incorporate both user1884905's modifications to the algorithm and Karl's improvements to the code itself.
import itertools
import string
_char_morse_map = {"a":".-", "b":"-...", "c":"-.-.", "d":"-..", \
"e":".", "f":"..-.", "g":"--.", "h":"....", \
"i":"..", "j":".---", "k":"-.-", "l":".-..", \
"m":"--", "n":"-.", "o":"---", "p":".--.", \
"q":"--.-", "r":".-.", "s":"...", "t":"-", \
"u":"..-", "v":"...-", "w":".--", "x":"-..-", \
"y":"-.--", "z":"--..", "1":".----", "2":"..---", \
"3":"...--", "4":"....-", "5":".....", "6":"-....", \
"7":"--...", "8":"---..", "9":"----.", "0":"-----",
" ":"", ".":"d", "+":"p", "-":"h"}
_morse_char_map = {
code: letter
for letter, code in _char_morse_map.items()
}
def encode(s):
return "".join(_char_morse_map.get(c, c) + "+" for c in s)
def decode(encoded):
return "".join(decode_gen(encoded))
def decode_gen(encoded):
word = ""
for c in encoded:
if c != "+":
word += c
else:
yield _morse_char_map.get(word, word) if word != "" else " "
word = ""
def compress(s):
def grouper(n, k):
return str(n) + k if n > 1 else k
return "".join(grouper(len(list(g)), k) for k, g in itertools.groupby(s))
def decompress(compressed):
return "".join(decompress_gen(compressed))
def decompress_gen(compressed):
digits = ""
for c in compressed:
if c in string.digits:
digits += c
else:
number = int(digits) if digits else 1
yield "".join(c * number)
digits = ""
No. It can be proven that there is not even an algorithm to determine how well a perfect compressor will do. See Kolmogorov Complexity. Huffman coding (or arithmetic coding) by itself does not get close to the best compression.
Compression algorithms are normally used to reduce the size of a file without removing information. This can increase their entropy and make the files appear more random because all of the possible bytes become more common.
C is a great choice for writing a compression program. You can use plenty of other languages too, though.
A simple change that can be made to compress your output (at least in most cases) is to keep the idea of a pause between two letters and to let your +
-signs denote this pause (i.e. the pluses are used as a new-character-character instead of using them as padding).
This will make all numbers 0-9
one character longer but make quite a few commonly occurring letters shorter.
For example a
, e
, i
and t
will become .-+
, .+
, ..+
and -+
instead of .-+++
, .++++
, ..+++
and -++++
. (And space
can be denoted +
instead of +++++
)
So your Hello, World
example would become:
H+.+.-..+.-..+---+,++W+---+.-.+.-..+-..+
instead of
H++++.++++.-..+.-..+---++,+++++++++W++++---++.-.++.-..+-..++
and compress to
H+.+.-2.+.-2.+3-+,2+W+3-+.-.+.-2.+-2.+
instead of
H4+.4+.-2.+.-2.+3-2+,9+W4+3-2+.-.2+.-2.+-2.2+
Understanding this reasoning, Huffman encoding will only seem like the natural next step, where the basic idea is to make the most common characters take up as little amount of space as possible.
Edit:
See also this wikipedia-image of a dichotomatic search table showing the most commonly occurring characters close to the top of the tree.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With