Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ASCII compressor works for short test file, not on long

The current project in Systems Programming is to come up with an ASCII compressor that removes the top zero bit and writes the contents to the file.

In order to facilitate decompression, the original file size is written to file, then the compressed char bytes. There are two files to run tests on- one that is 63 bytes long, and the other is 5344213 bytes. My code below works as expected for the first test file, as it writes 56 bytes of compressed text plus 4 bytes of file header.

However, when I try it on the long test file, the compressed version is 3 bytes shorter than the original, when it should be roughly 749KiB smaller, or 14% of original size. I've worked out the binary bit shift values for the first two write loops of the long test file, and they match up what is being recorded on my test printout.

while ( (characters= read(openReadFile, unpacked, BUFFER)) >0 ){
   unsigned char packed[7]; //compression storage
   int i, j, k, writeCount, endLength, endLoop;

    //loop through the buffer array
    for (i=0; i< characters-1; i++){
        j= i%7; 

        //fill up the compressed array
        packed[j]= packer(unpacked[i], unpacked[i+1], j);

        if (j == 6){
            writeCalls++; //track how many calls made

            writeCount= write(openWriteFile, packed, sizeof (packed));
            int packedSize= writeCount;
            for (k=0; k<7 && writeCalls < 10; k++)
                printf("%X ", (int)packed[k]);      

            totalWrittenBytes+= packedSize;
            printf(" %d\n", packedSize);
            memset(&packed[0], 0, sizeof(packed)); //clear array

            if (writeCount < 0)
                printOpenErrors(writeCount);
        }
        //end of buffer array loop
        endLength= characters-i;
        if (endLength < 7){

            for (endLoop=0; endLoop < endLength-1; endLoop++){
                packed[endLoop]= packer(unpacked[endLoop], unpacked[endLoop+1], endLoop);
            }

            packed[endLength]= calcEndBits(endLength, unpacked[endLength]);
        }
    } //end buffer array loop
} //end file read loop

The packer function:

//calculates the compressed byte value for the array
char packer(char i, char j, int k){
    char packStyle;

    switch(k){
        //shift bits based on mod value with 8
        case 0:
                packStyle= ((i & 0x7F) << 1) | ((j & 0x40) >> 6);
            break;
        case 1:
            packStyle= ((i & 0x3F) << 2) | ((j & 0x60) >> 5);
            break;
        case 2:
            packStyle= ((i & 0x1F) << 3) | ((j & 0x70) >> 4);
            break;
        case 3:
            packStyle= ((i & 0x0F) << 4) | ((j & 0x78) >> 3);
            break;
        case 4:
            packStyle= ((i & 0x07) << 5) | ((j & 0x7C) >> 2);
            break;
        case 5:
            packStyle= ((i & 0x03) << 6) | ((j & 0x7E) >> 1);
            break;
        case 6:
            packStyle= ( (i & 0x01 << 7) | (j & 0x7F));
            break;
    }

    return packStyle;
}

I've verified that there are 7 bytes written out every time the packed buffer is flushed, and there are 763458 write calls made for the long file, which match up to 5344206 bytes written.

I'm getting the same hex codes from the printout that I worked out in binary beforehand, and I can see the top bit of every byte removed. So why aren't the bit shifts being reflected in the results?

like image 703
Jason Avatar asked Mar 05 '11 04:03

Jason


People also ask

How does compression work on text files?

Text compression typically works by finding similar strings within a text file, and replacing those strings with a temporary binary representation to make the overall file size smaller.

What type of compression is used for text?

Text compression algorithms aim at statistical reductions in the volume of data. One commonly used compression algorithm is Huffman coding [Huf52], which makes use of information on the frequency of characters to assign variable-length codes to characters.

Can text files be compressed?

Computers can compress text in a similar way, by finding repeated sequences and replacing them with shorter representations. They don't need to worry about the end result sounding the same, like people do, so they can compress even further.

Can Unicode be compressed?

By switching to Unicode mode, non-alphabetic scripts can be encoded with two bytes per character. The compression scheme is capable of compressing strings containing any Unicode character.


1 Answers

Ok, since this is homework I'll just give you a few hints without giving out a solution.

First are you sure that the 56 bytes you get on the first file are the right bytes? Sure the count looks good, but you got lucky on count (proof is the second test file). I can immediately see at least two key mistakes in the code.

To make sure you have the right output, the byte count is not enough. You need to dig deeper. How about checking the bytes themselves one by one. 63 characters is not that much to go heh? There are many ways you can do this. You could use od (a pretty good Linux/Unix tool to look at the binary contents of files, if you're on Windows use some Hex editor). Or you could print out debug information from within your program.

Good luck.

like image 147
asoundmove Avatar answered Oct 24 '22 18:10

asoundmove