We are all fans of portable C/C++ programs.
We know that sizeof(char)
or sizeof(unsigned char)
is always 1
"byte". But that 1
"byte" doesn't mean a byte with 8 bits. It just means a "machine byte", and the number of bits in it can differ from machine to machine. See this question.
Suppose you write out the ASCII letter 'A' into a file foo.txt
. On any normal machine these days, which has a 8-bit machine byte, these bits would get written out:
01000001
But if you were to run the same code on a machine with a 9-bit machine byte, I suppose these bits would get written out:
001000001
More to the point, the latter machine could write out these 9 bits as one machine byte:
100000000
But if we were to read this data on the former machine, we wouldn't be able to do it properly, since there isn't enough room. Somehow, we would have to first read one machine byte (8 bits), and then somehow transform the final 1 bit into 8 bits (a machine byte).
How can programmers properly reconcile these things?
The reason I ask is that I have a program that writes and reads files, and I want to make sure that it doesn't break 5, 10, 50 years from now.
How can programmers properly reconcile these things?
By doing nothing. You've presented a filesystem problem.
Imagine that dreadful day when the first of many 9-bit machines is booted up, ready to recompile your code and process that ASCII letter A
that you wrote to a file last year.
To ensure that a C/C++ compiler can reasonably exist for this machine, this new computer's OS follows the same standards that C and C++ assume, where files have a size measured in bytes.
...There's already a little problem with your 8-bit source code. There's only about a 1-in-9 chance each source file is a size that can even exist on this system.
Or maybe not. As is often the case for me, Johannes Schaub - litb has pre-emptively cited the standard regarding valid formats for C++ source code.
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences (2.3) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that des- ignates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)
"In an implementation-defined manner." That's good news...as long as some method exists to convert your source code to any 1:1 format that can be represented on this machine, you can compile it and run your program.
So here's where your real problem lies. If the creators of this computer were kind enough to provide a utility to bit-extend 8-bit ASCII files so they may be actually stored on this new machine, there's already no problem with the ASCII letter A
you wrote long ago. And if there is no such utility, then your program already needs maintenance and there's nothing you could have done to prevent it.
Edit: The shorter answer (addressing comments that have since been deleted)
The question asks how to deal with a specific 9-bit computer...
Damian Conway has an often-repeated quote comparing C++ to C:
"C++ tries to guard against Murphy, not Machiavelli."
He was describing other software engineers, not hardware engineers, but the intention is still sound because the reasoning is the same.
Both C and C++ are standardized in a way that requires you to presume that other engineers want to play nice. Your Machiavellian computer is not a threat to your program because it's a threat to C/C++ entirely.
Returning to your question:
How can programmers properly reconcile these things?
You really have two options.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With