The original MIX architecture features 6-bit bytes and memory is addressed as 31-bit words (5 bytes and a sign bit). As a thought exercise I'm wondering how the C language can function in this environment, given:
Approaches that I can think of:
So that seems to leave the first (wasteful) option of using 31-bit chars with all objects sized as multiples of char - am I correct in reading it this way?
I agree that C on a MIX architecture could be a pain to implement, and, though not a language lawyer myself, it seems to me that you are right to point out your approach 1. as the only conforming one.
Anyway, the space waste for strings is the least of your problems: you can circumvent it by resorting to a solution older than C itself: making each char
representing multiple letters. For the MIX architecture you could for example devise a 7-bit encoding and packing 4 letters into each char:
char hi[4];
hi[0] = 'hell';
hi[1] = 'o, w';
hi[2] = 'orld';
hi[3] = '\0';
printf("%s", hi);
// Whoops, we forgot the exclamation mark
putchar('!\n');
This implementation seems strange, but according to Wikipedia, it was used in the first "Hello world" program ever. I took a look at the Standard and found nothing preventing it, even in C11. Especially § 6.4.4.4 allows encoding literal characters and strings in implementation-specific ways.
EDIT:
This does not help against other difficulties, the main one being your inability to use the wide majority of your machine's possible instructions, since you cannot address single bytes with the native C types. You can however make use of bitfields in this way:
typedef struct _bytes {
unsigned int sign : 1;
unsigned int byte1 : 6; // EDIT: bitfields must be
unsigned int byte2 : 6; // declared as ints in standard C
unsigned int byte3 : 6;
unsigned int byte4 : 6;
unsigned int byte5 : 6;
} bytes;
typedef union _native_type {
char as_word;
int as_int; // int = char; useful for standard library functions, etc.
bytes as_bytes;
} native_type;
Note that in C++, because of a clause in the strict aliasing rules, you must be careful to always access the char
member between an access to int
and one to the bytes
, since this snippet:
native_type a, b;
a.as_int = 0xC11BABE;
b.as_bytes.byte4 = a.as_bytes.byte4; // Whoops
would yield undefined behavior: see here for details.
The most practical approach would probably to make int
be 30 bits and have char
be either 10 or 15. Using ten bits for char
would allow ASCII text to be packed more tightly, but would increase the cost of indexing into char
arrays because of the required divide-by-three. Storage of Unicode text could be fairly efficient with either 10- or 15-byte char
. With 15-byte char
, about 30720 code points would take 15 bits, and the remainder would take 30. With 10-byte char
, 128 code points would take 10 bits, 65408 would take 20, and the remainder would take 30.
To mitigate the cost of dividing by 3, it may be helpful to have each char*
contain two words; one would would identify the word containing the character, and the other would identify the offset, in characters, from the start of that word. Adding a constant offset to a pointer which was known to be normalized could use code like:
p += 5; // Becomes...
if (p.offset) { p.offset=2; p.base+=1; }
else { p.offset--; p.base+=2; }
Not great, but it would avoid any need for a "divide" step.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With