Data Compression : Arithmetic coding unclear

Tags:

Can anyone please explain arithmetic encoding for data compression with implementation details ? I have surfed through internet and found mark nelson's post but the implementation's technique is indeed unclear to me after trying for many hours.

Mark nelson's explanation on arithmetic coding can be located at

http://marknelson.us/1991/02/01/arithmetic-coding-statistical-modeling-data-compression/

492

asked Apr 13 '12 12:04

Abhishek

2 Answers

Maybe this script could be useful to build a better mental model of arithmetic coder: gen_map.py. Originally it was created to facilitate debugging of arithmetic coder library and simplify generation of unit tests for it. However it creates nice ASCII visualizations that also could be useful in understanding arithmetic coding.

A small example. Imagine we have an alphabet of 3 symbols: 0, 1 and 2 with probabilities 1/10, 2/10 and 7/10 correspondingly. And we want to encode sequence [1, 2]. Script will give the following output (ignore -b N option for now):

Click to copy

$ ./gen_map.py -b 6 -m "1,2,7" -e "1,2"
000000111111|1111|111222222222222222222222222222222222222222222222
------011222|2222|222000011111111122222222222222222222222222222222
---------011|2222|222-------------00011111122222222222222222222222
------------|----|-------------------------00111122222222222222222
------------|----|-------------------------------01111222222222222
------------|----|------------------------------------011222222222
==================================================================
000000000000|0000|000000000000000011111111111111111111111111111111
000000000000|0000|111111111111111100000000000000001111111111111111
000000001111|1111|000000001111111100000000111111110000000011111111
000011110000|1111|000011110000111100001111000011110000111100001111
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
001100110011|0011|001100110011001100110011001100110011001100110011
010101010101|0101|010101010101010101010101010101010101010101010101

First 6 lines (before ==== line) represent a range from 0.0 to 1.0 which is recursively subdivided on intervals proportional to symbol probabilities. Annotated first line:

Click to copy

[1/10][     2/10    ][                 7/10                      ]
000000111111|1111|111222222222222222222222222222222222222222222222

Then we subdivide each interval again:

Click to copy

[ 0.1][     0.2     ][                 0.7                       ]
000000111111|1111|111222222222222222222222222222222222222222222222
         [   0.7    ][.1][   0.2 ][          0.7                 ]
------011222|2222|222000011111111122222222222222222222222222222222
                                  [.1][ .2][   0.7               ]  
---------011|2222|222-------------00011111122222222222222222222222

Note, that some intervals are not subdivided. That happens when there is not enough space to represent every subinterval within given precision (which is specified by -b option).

Each line corresponds to a symbol from the input (in our case - sequence [1, 2]). By following subintervals for each input symbol we'll get a final interval that we want to encode with minimal amount of bits. In our case it's a first 2 subinterval on a second line:

Click to copy

         [ This one ]
------011222|2222|222000011111111122222222222222222222222222222222

Following 7 lines (after ====) represent the same interval 0.0 to 1.0, but subdivided according to binary notation. Each line is a bit of output and by choosing between 0 and 1 you choose left or right half-subinterval. For example bits 01 corresponds to subinterval [0.25, 05) on a second line:

Click to copy

                  [   This one   ]
000000000000|0000|111111111111111100000000000000001111111111111111

The idea of arithmetic coder is to output bits (0 or 1) until the corresponding interval will be entirely inside (or equal to) the interval determined by the input sequence. In our case it's 0011. The ~~~~ line shows where we have enough bits to unambiguously identify the interval we want.

Vertical lines formed by | symbol show the range of bit sequences (rows) that could be used to encode the input sequence.

answered Sep 21 '22 12:09

wonder.mice

The main idea with arithmetic compression is its the capability to code a probability using the exact amount of data length required.

This amount of data is known, proven by Shannon, and can be calculated simply by using the following formula : -log2(p)

For example, if p=50%, then you need 1 bit. And if p=25%, you need 2 bits.

That's simple enough for probabilities which are power of 2 (and in this special case, huffman coding could be enough). But what if the probability is 63% ? Then you need -log2(0.63) = 0.67 bits. Sounds tricky...

This property is especially important if your probability is high. If you can predict something with a 95% accuracy, then you only need 0.074 bits to represent a good guess. Which means you are going to compress a lot.

Now, how to do that ?

Well, it's simpler than it sounds. You will divide your range depending on probabilities. For example, if you have a range of 100, 2 possible events, and a probability of 95% for the 1st one, then the first 95 values will say "Event 1", and the last 5 remaining values will say "Event 2".

OK, but on computers, we are accustomed to use powers of 2. For example, with 16 bits, you have a range of 65536 possible values. Just do the same : take the 1st 95% of the range (which is 62259) to say "Event 1", and the rest to say "Event 2". You obviously have a problem of "rounding" (precision), but as long as you have enough values to distribute, it does not matter too much. Furthermore, you are not constrained to 2 events, you could have a myriad of events. All that matters is that values are allocated depending on the probabilities of each event.

OK, but now i have 62259 possible values to say "Event 1", and 3277 to say "Event 2". Which one should i choose ? Well, any of them will do. Wether it is 1, 30, 5500 or 62256, it still means "Event 1".

In fact, deciding which value to select will not depend on the current guess, but on the next ones.

Suppose i'm having "Event 1". So now i have to choose any value between 0 and 62256. On next guess, i have the same distribution (95% Event 1, 5% Event 2). I will simply allocate the distribution map with these probabilities. Except that this time, it is distributed over 62256 values. And we continue like this, reducing the range of values with each guess.

So in fact, we are defining "ranges", which narrow with each guess. At some point, however, there is a problem of accuracy, because very little values remain.

The idea, is to simply "inflate" the range again. For example, each time the range goes below 32768 (2^15), you output the highest bit, and multiply the rest by 2 (effectively shifting the values by one bit left). By continuously doing like this, you are outputting bits one by one, as they are being settled by the series of guesses.

Now the relation with compression becomes obvious : when the range are narrowed swiftly (ex : 5%), you output a lot of bits to get the range back above the limit. On the other hand, when the probability is very high, the range narrow very slowly. You can even have a lot of guesses before outputting your first bits. That's how it is possible to compress an event to "a fraction of a bit".

I've intentionally used the terms "probability", "guess", "events" to keep this article generic. But for data compression, you just to replace them with the way you want to model your data. For example, the next event can be the next byte; in this case, you have 256 of them.

answered Sep 17 '22 12:09

Cyan

Related questions
                            
                                Finding Strongly Connected Components in a graph through DFS
                            
                                Calculating Standard Deviation of Angles?
                            
                                find kth smallest number in O(logn) time
                            
                                My Algorithm to Calculate Position of Smartphone - GPS and Sensors
                            
                                Difference between a stochastic and a heuristic algorithm
                            
                                Rate-limiting python decorator
                            
                                Is kd-tree always balanced?
                            
                                recent Google interview puzzle on bitwise operation
                            
                                Find the number of elements greater than x in a given range
                            
                                Is it possible to find the number of triangles that can be formed from a list of lengths in better than (n choose 3) time?
                            
                                Is there a fast algorithm to remove repeated substrings in a string?
                            
                                Visualization of the Recaman Sequence
                            
                                How to build a simple recommendation system?
                            
                                Given a set of points, how do I find the two points that are farthest from each other? [duplicate]
                            
                                Finding a minimum bounding sphere for a frustum
                            
                                Are interval, segment, fenwick trees the same?
                            
                                Time and space complexity of vector dot-product computation
                            
                                Sorting a queue using same queue
                            
                                Algorithm to find edit distance to all substrings
                            
                                How do you partition an array into 2 parts such that the two parts have equal average?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Data Compression : Arithmetic coding unclear

Tags:

algorithm

encoding

compression

entropy

lossless-compression

Abhishek

People also ask

2 Answers

wonder.mice

Cyan

Recent Activity

Donate For Us