Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"char*" with an unusual memory word size (Knuth's MIX architecture)

The original MIX architecture features 6-bit bytes and memory is addressed as 31-bit words (5 bytes and a sign bit). As a thought exercise I'm wondering how the C language can function in this environment, given:

  • char has at least 8 bits (annex E of C99 spec)
  • C99 spec section 6.3.2.3 ("Pointers") paragraph 8 says "When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object." My interpretation of this requirement is that it underpins "memcpy(&dst_obj, &src_obj, sizeof(src_obj))".

Approaches that I can think of:

  1. Make char 31 bits, so indirection through "char*" is simple memory access. But this makes strings wasteful (and means it isn't POSIX-compliant as that apparently requires 8 bit chars)
  2. Pack three 8 bit chars into one word, with 7 ignored bits: "char*" might be composed of word address and char index within it. However this seems to violate 6.3.2.3, i.e. memcpy() would necessarily skip the ignored bits (which are probably meaningful for the real object type)
  3. Fully pack chars into words, e.g. the fourth 8 bit char would have 7 bits in word 0 and one bit in word 1. However this seems to require that all objects are sized in 8 bit chars, e.g. a "uint31_t" couldn't be declared to match the word length since this again has the memcpy() problem.

So that seems to leave the first (wasteful) option of using 31-bit chars with all objects sized as multiples of char - am I correct in reading it this way?

like image 464
Software Tyro Avatar asked Mar 04 '15 10:03

Software Tyro


2 Answers

I agree that C on a MIX architecture could be a pain to implement, and, though not a language lawyer myself, it seems to me that you are right to point out your approach 1. as the only conforming one.

Anyway, the space waste for strings is the least of your problems: you can circumvent it by resorting to a solution older than C itself: making each char representing multiple letters. For the MIX architecture you could for example devise a 7-bit encoding and packing 4 letters into each char:

char hi[4];
hi[0] = 'hell';
hi[1] = 'o, w';
hi[2] = 'orld';
hi[3] = '\0';

printf("%s", hi);

// Whoops, we forgot the exclamation mark
putchar('!\n');

This implementation seems strange, but according to Wikipedia, it was used in the first "Hello world" program ever. I took a look at the Standard and found nothing preventing it, even in C11. Especially § 6.4.4.4 allows encoding literal characters and strings in implementation-specific ways.

EDIT:

This does not help against other difficulties, the main one being your inability to use the wide majority of your machine's possible instructions, since you cannot address single bytes with the native C types. You can however make use of bitfields in this way:

typedef struct _bytes {
    unsigned int sign  : 1;
    unsigned int byte1 : 6; // EDIT: bitfields must be 
    unsigned int byte2 : 6; // declared as ints in standard C
    unsigned int byte3 : 6;
    unsigned int byte4 : 6;
    unsigned int byte5 : 6;
} bytes;

typedef union _native_type {
    char as_word;
    int as_int; // int = char; useful for standard library functions, etc.
    bytes as_bytes;
} native_type;

Note that in C++, because of a clause in the strict aliasing rules, you must be careful to always access the char member between an access to int and one to the bytes, since this snippet:

native_type a, b;
a.as_int = 0xC11BABE;
b.as_bytes.byte4 = a.as_bytes.byte4; // Whoops

would yield undefined behavior: see here for details.

like image 117
Alberto M Avatar answered Sep 30 '22 07:09

Alberto M


The most practical approach would probably to make int be 30 bits and have char be either 10 or 15. Using ten bits for char would allow ASCII text to be packed more tightly, but would increase the cost of indexing into char arrays because of the required divide-by-three. Storage of Unicode text could be fairly efficient with either 10- or 15-byte char. With 15-byte char, about 30720 code points would take 15 bits, and the remainder would take 30. With 10-byte char, 128 code points would take 10 bits, 65408 would take 20, and the remainder would take 30.

To mitigate the cost of dividing by 3, it may be helpful to have each char* contain two words; one would would identify the word containing the character, and the other would identify the offset, in characters, from the start of that word. Adding a constant offset to a pointer which was known to be normalized could use code like:

p += 5; // Becomes...
if (p.offset) { p.offset=2; p.base+=1; }
else { p.offset--; p.base+=2; }

Not great, but it would avoid any need for a "divide" step.

like image 34
supercat Avatar answered Sep 30 '22 06:09

supercat