Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Programming novice: How to program my own data compression algorithm?

It is summer, and so I have decided to take it upon myself to write a data-compression program, preferably in C code. I have a decent beginners understanding of how compression works. I just have a few questions:

1) Would c be a suitable programming language to accomplish this task?
2) Should I be working in byte's with the input file? Or at a binary level somehow?

If someone could just give me a nudge in the correct direction, I'd really appreciate it. I would like to code this myself however, and not use a pre-existing compression library or anything like that.

like image 650
araisbec Avatar asked May 24 '11 17:05

araisbec


People also ask

What kind of coding is used for compressing data?

Most forms of lossy compression are based on transform coding, especially the discrete cosine transform (DCT).

Which algorithm is best for data compression?

The fastest algorithm, lz4, results in lower compression ratios; xz, which has the highest compression ratio, suffers from a slow compression speed. However, Zstandard, at the default setting, shows substantial improvements in both compression speed and decompression speed, while compressing at the same ratio as zlib.


2 Answers

You could start by looking at Huffman Encoding. A lot of computer science classes implement that as a project so it should be manageable. C would be appropriate for Huffman encoding, but it might be easier to do it first in a higher-level language so that you understand the concepts.There are slides, hints, and an example project available in Java for a masters-level project at the University of Pennsylvania (search for "huff" on that page).

like image 145
Brian Lyttle Avatar answered Sep 28 '22 20:09

Brian Lyttle


To answer your questions:

  1. C is suitable.
  2. It depends on the algorithm, or the way you are thinking about `compression'.

My opinion will be, first decide whether you want to do a lossless compression or a lossy compression, then pick an algorithm to implement. Here are a few pointers:

For the lossless one, some are very intuitive, such as the run-length encoding, e.g., if there is 11 as and 5 bs, you just encode them as 11a5b. Some algorithms use a dictionary, please refer to LZW encoding. Finally, I do recommend Huffman encoding since it is very straight-forward, simple and helpful to gain experience in learning algorithm (for your educational purpose).

For lossy ones, Discrete Fourier Transform (DFT), or wavelet, is used in JPEG compression. This is useful to understand multimedia compression.

Wikipedia page is a good starting point.

like image 35
Ivan Xiao Avatar answered Sep 28 '22 18:09

Ivan Xiao