Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient mapping for a particular finite integer set

I'm looking for a small, fast (in both directions) bijective mapping between the following list of integers and a subset of the range 0-127:

0x200C, 0x200D, 0x200E, 0x200F,
0x2013, 0x2014, 0x2015, 0x2017,
0x2018, 0x2019, 0x201A, 0x201C,
0x201D, 0x201E, 0x2020, 0x2021,
0x2022, 0x2026, 0x2030, 0x2039,
0x203A, 0x20AA, 0x20AB, 0x20AC,
0x20AF, 0x2116, 0x2122

One obvious solution is:

y = x>>2 & 0x40 | x & 0x3f;
x = 0x2000 | y<<2 & 0x100 | y & 0x3f;

Edit: I was missing some of the values, particularly 0x20Ax, which don't work with the above.

Another obvious solution is a lookup table, but without making it unnecessarily large, a lookup table would require some bit rearrangement anyway and I suspect the whole task can be better accomplished with simple bit rearrangement.

For the curious, those magic numbers are the only "large" Unicode codepoints that appear in legacy ISO-8859 and Windows codepages.

like image 937
R.. GitHub STOP HELPING ICE Avatar asked Feb 06 '11 06:02

R.. GitHub STOP HELPING ICE


1 Answers

This method uses multiplication in a finite field:

#define PRIME 0x119
#define OFFSET1 0x00f
#define OFFSET2 0x200c
#define OFFSET3 (OFFSET2 - OFFSET1)
#define MULTIPLIER 2
#define INVERSE 0x8d

unsigned map(unsigned n)
{
    return ((n - OFFSET3) * MULTIPLIER) % PRIME;
}

unsigned unmap(unsigned m)
{
    return ((m * INVERSE) + PRIME - OFFSET1) % PRIME + OFFSET2;
}

map() converts the unicode points to the unique 7 bit numbers, and unmap() does the reverse. Note that gcc at least is able to compile this to x86 code which does not use any division operations, since the modulus is a constant.

like image 136
caf Avatar answered Sep 24 '22 15:09

caf