Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compress 21 Alphanumeric Characters in to 16 Bytes

Tags:

c++

algorithm

I'm trying to take 21 bytes of data which uniquely identifies a trade and store it in a 16 byte char array. I'm having trouble coming up with the right algorithm for this.

The trade ID which I'm trying to compress consists of 2 fields:

  1. 18 alphanumeric characters consisting of the ASCII characters 0x20 to 0x7E, Inclusive. (32-126)
  2. A 3-character numeric string "000" to "999"

So a C++ class that would encompass this data looks like this:

class ID
{
public:
    char trade_num_[18];
    char broker_[3];
};

This data needs to be stored in a 16-char data structure, which looks like this:

class Compressed
{
public:
    char sku_[16];    
};

I tried to take advantage of the fact that since the characters in trade_num_ are only 0-127 there was 1 unused bit in each character. Similarly, 999 in binary is 1111100111, which is only 10 bits -- 6 bits short of a 2-byte word. But when I work out how much I can squeeze this down, the smallest I can make it is 17 bytes; one byte too big.

Any ideas?

By the way, trade_num_ is a misnomer. It can contain letters and other characters. That's what the spec says.

EDIT: Sorry for the confusion. The trade_num_ field is indeed 18 bytes and not 16. After I posted this thread my internet connection died and I could not get back to this thread until just now.

EDIT2: I think it is safe to make an assumption about the dataset. For the trade_num_ field, we can assume that the non-printable ASCII characters 0-31 will not be present. Nor will ASCII codes 127 or 126 (~). All the others might be present, including upper and lower case letters, numbers and punctuations. This leaves a total of 94 characters in the set that trade_num_ will be comprised of, ASCII codes 32 through 125, inclusive.

like image 313
John Dibling Avatar asked Aug 05 '10 22:08

John Dibling


2 Answers

If you have 18 characters in the range 0 - 127 and a number in the range 0 - 999 and compact this as much as possible then it will require 17 bytes.

>>> math.log(128**18 * 1000, 256)
16.995723035582763

You may be able to take advantage of the fact that some characters are most likely not used. In particular it is unlikely that there are any characters below value 32, and 127 is also probably not used. If you can find one more unused character so you can first convert the characters into base 94 and then pack them into the bytes as closely as possible.

>>> math.log(94**18 * 1000, 256)
15.993547951857446

This just fits into 16 bytes!


Example code

Here is some example code written in Python (but written in a very imperative style so that it can easily be understood by non-Python programmers). I'm assuming that there are no tildes (~) in the input. If there are you should substitute them with another character before encoding the string.

def encodeChar(c):
    return ord(c) - 32

def encode(s, n):
    t = 0
    for c in s:
        t = t * 94 + encodeChar(c)
    t = t * 1000 + n

    r = []
    for i in range(16):
        r.append(int(t % 256))
        t /= 256

    return r

print encode('                  ', 0)    # smallest possible value
print encode('abcdefghijklmnopqr', 123)
print encode('}}}}}}}}}}}}}}}}}}', 999)  # largest possible value

Output:

[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0]
[ 59, 118, 192, 166, 108,  50, 131, 135, 174,  93,  87, 215, 177,  56, 170, 172]
[255, 255, 159, 243, 182, 100,  36, 102, 214, 109, 171,  77, 211, 183,   0, 247]

This algorithm uses Python's ability to handle very large numbers. To convert this code to C++ you could use a big integer library.

You will of course need an equivalent decoding function, the principle is the same - the operations are performed in reverse order.

like image 158
Mark Byers Avatar answered Sep 27 '22 02:09

Mark Byers


That makes (18*7+10)=136 bits, or 17 bytes. You wrote trade_num is alphanumeric? If that means the usual [a-zA-Z0-9_] set of characters, then you'd have only 6 bits per character, needing (18*6+10)=118 bit = 15 bytes for the whole thing.

Assuming 8 bit = 1 byte

Or, coming from another direction: You have 128 bits for storage, you need ~10 bits for the number part, so there are 118 left for the trade_num. 18 characters means 118/18=6.555 bits per characters, this means you can have only the space to encode 26.555 = 94 different characters **unless there is a hidden structure in trade_num that we could exploit to save more bits.

like image 44
Nordic Mainframe Avatar answered Sep 23 '22 02:09

Nordic Mainframe