I have an implementation off the LZW compression/decompression algorithm and for the most part have it squared away. However, I have encountered a problem with one of the files that I am testing. Below is the text for said file
#include "bits.h"
int check_endianness(){
int i = 1;
The area that my implementation is getting stuck on is the group of spaces right before int i = 1;
Below I have included my compression and decompression loop respectively along with their relative debug outputs.
compression loop
i=0;
while(i < input_len && ret == LZW_ERR_OK){
//get next byte
char curChar = input_char(f_input, &io_flags);
i++;
//not necessary to check for stream end here since the while condition does that
if(io_flags == STREAM_ERR_READ){
ret = LZW_ERR_READ;
break;
}
seqset(&temp, &curChar, 1);
//add bytes to temp until a sequence is found that is not in lookup
while(i < input_len && dictionary_has_entry(lookup, temp)){
char curChar = input_char(f_input, &io_flags);
i++;
//check for errors / end of file
if(io_flags != STREAM_ERR_OK){
if(io_flags == STREAM_ERR_READ)
ret = LZW_ERR_READ;
break;
}
seqadd(&temp, &curChar, 1);
}
if(temp.length > 1){
#ifdef DEBUG
printf("Adding entry %d: ", lookup->next_value);
for(int j = 0; j < temp.length; j++)
printf("%.4x ", temp.data[j]);
printf("\n");
#endif // DEBUG
dictionary_set_entry(lookup, temp, DICT_SET_FOREVER);
temp.length--; //if temp.length == 1, then the EOF was probably reached
i--; //This is important so that the entire file is read
}
fseek(f_input, -1, SEEK_CUR);
output_short(f_output, dictionary_get_entry_byKey(lookup, temp)->value, STREAM_USE_ENDIAN);
#ifdef DEBUG
printf("Writing code: %d\n", dictionary_get_entry_byKey(lookup, temp)->value);
#endif // DEBUG
}
compression output
Adding entry 297: 007b 000d
Writing code: 123
Adding entry 298: 000d 000a 0020
Writing code: 273
Adding entry 299: 0020 0020
Writing code: 32
Adding entry 300: 0020 0020 0020
Writing code: 299
Adding entry 301: 0020 0069
Writing code: 32
Adding entry 302: 0069 006e 0074 0020
decompression loop
i=0;
while(i < input_len && ret == LZW_ERR_OK){
short code;
entry *e;
code = input_short(f_input, &io_flags, STREAM_USE_ENDIAN);
if(io_flags == STREAM_ERR_READ){
ret = LZW_ERR_READ;
break;
}
i += 2;
//e should never be NULL
printf("Reading code: %d\n", code);
e = dictionary_get_entry_byValue(lookup, code);
assert(e != NULL);
seqset(&temp, e->key.data, e->key.length);
//requires a slightly different approach to the lookup loop in lzw_encode
while(i < input_len && e != NULL){
code = input_short(f_input, &io_flags, STREAM_USE_ENDIAN);
//check for errors / end of file
if(io_flags != STREAM_ERR_OK){
if(io_flags == STREAM_ERR_READ)
ret = LZW_ERR_READ;
break;
}
i += 2;
printf("Reading code: %d\n", code);
//e should never be NULL
e = dictionary_get_entry_byValue(lookup, code);
assert(e != NULL); <------------ This is where it is failing
//start adding bytes to temp
for(size_t j = 0; j < e->key.length; j++){
seqadd(&temp, &e->key.data[j], 1);
if(dictionary_get_entry_byKey(lookup, temp) == NULL){
//sequence not found, add it to dictionary
printf("Adding entry %d: ", lookup->next_value);
dictionary_set_entry(lookup, temp, DICT_SET_FOREVER);
for(int k = 0; k < temp.length; k++)
printf("%.4x ", temp.data[k]);
printf("\n");
e = NULL; //to escape from while
break;
}
}
}
//if more than one code was read go back by two bytes
if(e == NULL){
i -= 2;
fseek(f_input, -2, SEEK_CUR);
}
//only write up to the characters that made a known sequence
temp.length--;
for(size_t j = 0; j < temp.length; j++){
output_char(f_output, temp.data[j]);
#ifdef DEBUG
//printf("%c", temp.data[j]);
#endif // DEBUG
}
}
decompression output
Reading code: 123
Reading code: 273
Adding entry 297: 007b 000d
Reading code: 273
Reading code: 32
Adding entry 298: 000d 000a 0020
Reading code: 32
Reading code: 299
299, 299 <----error output from dictionary (code asked > next value)
Assertion failed: e != NULL, file lzw.c, line 212
Any help would be much appreciated.
Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978.
The Lempel-Ziv algorithm consists of a rule for parsing strings of symbols from a finite alphabet into substrings, or words, whose lengths do not exceed a prescribed integer L (1); and a coding scheme which maps these substrings sequentially into uniquely decipherable codewords of fixed length L (2) [Ziv and Lempel ...
A particular LZW compression algorithm takes each input sequence of bits of a given length (for example, 12 bits) and creates an entry in a table (sometimes called a "dictionary" or "codebook") for that particular bit pattern, consisting of the pattern itself and a shorter code.
Full name. LZW (Lempel-Ziv-Welch) Image Compression Encoding. Description. A lossless compression algorithm for digital data of many kinds, named for the creators Abraham Lempel and Jacob Ziv, and a later contributor, Terry Welch. LZW is based on a translation table that maps strings of input characters into codes.
You hit the infamous KwKwK problem in the Lempel Ziv Welch decompression algorithm.
From the original article, A Technique for High-Performance Data Compression, Terry A. Welch, IEEE Computer, June 1984, pp. 8-19:
The abnormal case occurs whenever an input character string contains the sequence KwKwK, where Kw already appears in the compressor string table. The compressor will parse out Kw, send CODE(Kw), and add KwK to its string table. Next it will parse out KwK and send the just-generated CODE(KwK). The decompressor, on receiving CODE(KwK), will not yet have added that code to its string table because it does not yet know an extension character for the prior string.
The article explains how to work around this issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With