Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lempel-Ziv-Welch decompression non-existent index

I have an implementation off the LZW compression/decompression algorithm and for the most part have it squared away. However, I have encountered a problem with one of the files that I am testing. Below is the text for said file

#include "bits.h"

int check_endianness(){
    int i = 1;

The area that my implementation is getting stuck on is the group of spaces right before int i = 1; Below I have included my compression and decompression loop respectively along with their relative debug outputs.

compression loop

i=0;
while(i < input_len && ret == LZW_ERR_OK){
    //get next byte
    char curChar = input_char(f_input, &io_flags);
    i++;

    //not necessary to check for stream end here since the while condition does that
    if(io_flags == STREAM_ERR_READ){
        ret = LZW_ERR_READ;
        break;
    }

    seqset(&temp, &curChar, 1);

    //add bytes to temp until a sequence is found that is not in lookup
    while(i < input_len && dictionary_has_entry(lookup, temp)){
        char curChar = input_char(f_input, &io_flags);
        i++;

        //check for errors / end of file
        if(io_flags != STREAM_ERR_OK){
            if(io_flags == STREAM_ERR_READ)
                ret = LZW_ERR_READ;
            break;
        }

        seqadd(&temp, &curChar, 1);
    }

    if(temp.length > 1){
        #ifdef DEBUG
        printf("Adding entry %d: ", lookup->next_value);
        for(int j = 0; j < temp.length; j++)
            printf("%.4x ", temp.data[j]);
        printf("\n");
        #endif // DEBUG
        dictionary_set_entry(lookup, temp, DICT_SET_FOREVER);
        temp.length--; //if temp.length == 1, then the EOF was probably reached
        i--; //This is important so that the entire file is read
    }

    fseek(f_input, -1, SEEK_CUR);
    output_short(f_output, dictionary_get_entry_byKey(lookup, temp)->value, STREAM_USE_ENDIAN);
    #ifdef DEBUG
    printf("Writing code: %d\n", dictionary_get_entry_byKey(lookup, temp)->value);
    #endif // DEBUG
}

compression output

Adding entry 297: 007b 000d
Writing code: 123
Adding entry 298: 000d 000a 0020
Writing code: 273
Adding entry 299: 0020 0020
Writing code: 32
Adding entry 300: 0020 0020 0020
Writing code: 299
Adding entry 301: 0020 0069
Writing code: 32
Adding entry 302: 0069 006e 0074 0020

decompression loop

i=0;
while(i < input_len && ret == LZW_ERR_OK){
    short code;
    entry *e;
    code = input_short(f_input, &io_flags, STREAM_USE_ENDIAN);
    if(io_flags == STREAM_ERR_READ){
        ret = LZW_ERR_READ;
        break;
    }

    i += 2;

    //e should never be NULL
    printf("Reading code: %d\n", code);
    e = dictionary_get_entry_byValue(lookup, code);
    assert(e != NULL);

    seqset(&temp, e->key.data, e->key.length);

    //requires a slightly different approach to the lookup loop in lzw_encode
    while(i < input_len && e != NULL){
        code = input_short(f_input, &io_flags, STREAM_USE_ENDIAN);
        //check for errors / end of file
        if(io_flags != STREAM_ERR_OK){
            if(io_flags == STREAM_ERR_READ)
                ret = LZW_ERR_READ;
            break;
        }
        i += 2;
        printf("Reading code: %d\n", code);

        //e should never be NULL
        e = dictionary_get_entry_byValue(lookup, code);
        assert(e != NULL); <------------ This is where it is failing

        //start adding bytes to temp
        for(size_t j = 0; j < e->key.length; j++){
            seqadd(&temp, &e->key.data[j], 1);
            if(dictionary_get_entry_byKey(lookup, temp) == NULL){

                //sequence not found, add it to dictionary
                printf("Adding entry %d: ", lookup->next_value);
                dictionary_set_entry(lookup, temp, DICT_SET_FOREVER);
                for(int k = 0; k < temp.length; k++)
                    printf("%.4x ", temp.data[k]);
                printf("\n");
                e = NULL; //to escape from while
                break;
            }
        }
    }

    //if more than one code was read go back by two bytes
    if(e == NULL){
        i -= 2;
        fseek(f_input, -2, SEEK_CUR);
    }

    //only write up to the characters that made a known sequence
    temp.length--;
    for(size_t j = 0; j < temp.length; j++){
        output_char(f_output, temp.data[j]);
        #ifdef DEBUG
        //printf("%c", temp.data[j]);
        #endif // DEBUG
    }
}

decompression output

Reading code: 123
Reading code: 273
Adding entry 297: 007b 000d
Reading code: 273
Reading code: 32
Adding entry 298: 000d 000a 0020
Reading code: 32
Reading code: 299
299, 299 <----error output from dictionary (code asked > next value)
Assertion failed: e != NULL, file lzw.c, line 212

Any help would be much appreciated.

like image 740
Marcos Avatar asked Feb 09 '17 07:02

Marcos


People also ask

What is Lempel-Ziv code?

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978.

How does Lempel-Ziv coding accomplished?

The Lempel-Ziv algorithm consists of a rule for parsing strings of symbols from a finite alphabet into substrings, or words, whose lengths do not exceed a prescribed integer L (1); and a coding scheme which maps these substrings sequentially into uniquely decipherable codewords of fixed length L (2) [Ziv and Lempel ...

How does LZW algorithm work?

A particular LZW compression algorithm takes each input sequence of bits of a given length (for example, 12 bits) and creates an entry in a table (sometimes called a "dictionary" or "codebook") for that particular bit pattern, consisting of the pattern itself and a shorter code.

What is the full form of LZW?

Full name. LZW (Lempel-Ziv-Welch) Image Compression Encoding. Description. A lossless compression algorithm for digital data of many kinds, named for the creators Abraham Lempel and Jacob Ziv, and a later contributor, Terry Welch. LZW is based on a translation table that maps strings of input characters into codes.


1 Answers

You hit the infamous KwKwK problem in the Lempel Ziv Welch decompression algorithm.

From the original article, A Technique for High-Performance Data Compression, Terry A. Welch, IEEE Computer, June 1984, pp. 8-19:

The abnormal case occurs whenever an input character string contains the sequence KwKwK, where Kw already appears in the compressor string table. The compressor will parse out Kw, send CODE(Kw), and add KwK to its string table. Next it will parse out KwK and send the just-generated CODE(KwK). The decompressor, on receiving CODE(KwK), will not yet have added that code to its string table because it does not yet know an extension character for the prior string.

The article explains how to work around this issue.

like image 136
chqrlie Avatar answered Sep 30 '22 02:09

chqrlie