I have many unique numbers, all positive and the order doesn't matter, 0 < num < 2^32
.
Example: 23 56 24 26
The biggest, 56
, needs 6 bits
space. So, I need: 4*6 = 24 bits
in total.
I do the following to save space:
I sort them first: 23 24 26 56
(because the order doesn't matter)
Now I get the difference of each from the previous: 23 1 2 30
The biggest, 30
, needs 5 bits
space.
After this I store all the numbers in 4*5 bits = 20 bits
space.
Question: how to further improve this algorithm?
More information: Since requested, the numbers are mostly on the range of 2.000-4.000
. Numbers less than 300
are pretty rare. Numbers more than 16.000
are pretty rare also. Generally speaking, all the numbers will be close. For example, they may be all in the 1.000-2.000
range or they may all be in the 16.000-20.000
range. The total number of numbers will be something in the range of 500-5.000
.
Your first step is good one to take because sorting reduces the differences to least. Here is a way to improve your algorithm:
Use of Huffman coding is more important then your step; I'll show you why:
consider the following data:
1 2 3 4 5 6 7 4294967295
where 4294967295 = 2^32-1. Using your algorithm:
1 1 1 1 1 1 1 4294967288
total bits needed is still 32*8
Using Huffman coding, the frequencies are:
1 => 7
4294967288 => 1
Huffman codes are 1 => 0
and 4294967288 => 1
Total bits needed = 7*1 + 1 = 8 bits
Huffman coding reduces size by 32*8/8 = 32 times
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With