Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Making a program portable between machines that have different number of bits in a "machine byte"

Tags:

c++

c

We are all fans of portable C/C++ programs.

We know that sizeof(char) or sizeof(unsigned char) is always 1 "byte". But that 1 "byte" doesn't mean a byte with 8 bits. It just means a "machine byte", and the number of bits in it can differ from machine to machine. See this question.


Suppose you write out the ASCII letter 'A' into a file foo.txt. On any normal machine these days, which has a 8-bit machine byte, these bits would get written out:

01000001

But if you were to run the same code on a machine with a 9-bit machine byte, I suppose these bits would get written out:

001000001

More to the point, the latter machine could write out these 9 bits as one machine byte:

100000000

But if we were to read this data on the former machine, we wouldn't be able to do it properly, since there isn't enough room. Somehow, we would have to first read one machine byte (8 bits), and then somehow transform the final 1 bit into 8 bits (a machine byte).


How can programmers properly reconcile these things?

The reason I ask is that I have a program that writes and reads files, and I want to make sure that it doesn't break 5, 10, 50 years from now.

like image 315
Dennis Ritchie Avatar asked Jan 18 '13 12:01

Dennis Ritchie


1 Answers

How can programmers properly reconcile these things?

By doing nothing. You've presented a filesystem problem.

Imagine that dreadful day when the first of many 9-bit machines is booted up, ready to recompile your code and process that ASCII letter A that you wrote to a file last year.

To ensure that a C/C++ compiler can reasonably exist for this machine, this new computer's OS follows the same standards that C and C++ assume, where files have a size measured in bytes.

...There's already a little problem with your 8-bit source code. There's only about a 1-in-9 chance each source file is a size that can even exist on this system.

Or maybe not. As is often the case for me, Johannes Schaub - litb has pre-emptively cited the standard regarding valid formats for C++ source code.

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences (2.3) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that des- ignates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)

"In an implementation-defined manner." That's good news...as long as some method exists to convert your source code to any 1:1 format that can be represented on this machine, you can compile it and run your program.

So here's where your real problem lies. If the creators of this computer were kind enough to provide a utility to bit-extend 8-bit ASCII files so they may be actually stored on this new machine, there's already no problem with the ASCII letter A you wrote long ago. And if there is no such utility, then your program already needs maintenance and there's nothing you could have done to prevent it.

Edit: The shorter answer (addressing comments that have since been deleted)

The question asks how to deal with a specific 9-bit computer...

  • With hardware that has no backwards-compatible 8-bit instructions
  • With an operating system that doesn't use "8-bit files".
  • With a C/C++ compiler that breaks how C/C++ programs have historically written text files.

Damian Conway has an often-repeated quote comparing C++ to C:

"C++ tries to guard against Murphy, not Machiavelli."

He was describing other software engineers, not hardware engineers, but the intention is still sound because the reasoning is the same.

Both C and C++ are standardized in a way that requires you to presume that other engineers want to play nice. Your Machiavellian computer is not a threat to your program because it's a threat to C/C++ entirely.

Returning to your question:

How can programmers properly reconcile these things?

You really have two options.

  • Accept that the computer you describe would not be appropriate in the world of C/C++
  • Accept that C/C++ would not be appropriate for a program that might run on the computer you describe
like image 149
Drew Dormann Avatar answered Sep 30 '22 01:09

Drew Dormann