Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Huffman Decode Algorithm

I applied for a job recently and got sent a hackerrank exam with a couple of questions.One of them was a huffman decoding algorithm. There is a similar problem available here which explains the formatting alot better then I can.

The actual task was to take two arguments and return the decoded string.

The first argument is the codes, which is a string array like:

[
    "a      00",
    "b      101",
    "c      0111",
    "[newline]      1001"
]

Which is like: single character, two tabs, huffman code.

The newline was specified as being in this format due to the way that hacker rank is set up.

The second argument is a string to decode using the codes. For example:

101000111 = bac

This is my solution:

function decode($codes, $encoded) {
    $returnString = '';
    $codeArray = array();

    foreach($codes as $code) {
        sscanf($code, "%s\t\t%s", $letter, $code);
        if ($letter == "[newline]")
            $letter = "\n";
        $codeArray[$code] = $letter;
    }
    print_r($codeArray);

    $numbers = str_split($encoded);
    $searchCode = '';
    foreach ($numbers as $number) {
        $searchCode .= $number;
        if (isset($codeArray[$searchCode])) {
            $returnString .= $codeArray[$searchCode];
            $searchCode = '';
        }
    }

    return $returnString;
}

It passed the two initial tests but there were another five hidden tests which it did not pass and gave no feedback on.

I realize that this solution would not pass if the character was a white space so I tried a less optimal solution that used substr to get the first character and regex matching to get the number but this still passed the first two and failed the hidden five. I tried function in the hacker rank platform with white-space as input and the sandboxed environment could not handle it anyway so I reverted to the above solution as it was more elegant.

I tried the code with special characters, characters from other languages, codes of various sizes and it always returned the desired solution.

I am just frustrated that I could not find the cases that caused this to fail as I found this to be an elegant solution. I would love some feedback both on why this could fail given that there is no white-space and also any feedback on performance increases.

like image 699
Kieran Avatar asked Nov 07 '22 16:11

Kieran


1 Answers

Your basic approach is sound. Since a Huffman code is a prefix code, i.e. no code is a prefix of another, then if your search finds a match, then that must be the code. The second half of your code would work with any proper Huffman code and any message encoded using it.

Some comments. First, the example you provide is not a Huffman code, since the prefixes 010, 0110, 1000, and 11 are not present. Huffman codes are complete, whereas this prefix code is not.

This brings up a second issue, which is that you do not detect this error. You should be checking to see if $searchCode is empty after the end of your loop. If it is not, then the code was not complete, or a code ended in the middle. Either way, the message is corrupt with respect to the provided prefix code. Did the question specify what to do with errors?

The only real issue I would expect with this code is that you did not decode the code description generally enough. Did the question say there were always two tabs, or did you conclude that? Perhaps it was just any amount of space and tabs. Where there other character encodings you neeed to convert like [newline]? I presume you in fact did need to convert them, if one of the examples that worked contained one. Did it? Otherwise, maybe you weren't supposed to convert.

like image 192
Mark Adler Avatar answered Nov 14 '22 22:11

Mark Adler