Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are important points when designing a (binary) file format? [closed]

When designing a file format for recording binary data, what attributes would you think the format should have? So far, I've come up with the following important points:

  • have some "magic bytes" at the beginning, to be able to recognize the files (in my specific case, this should also help to distinguish the files from "legacy" files)
  • have a file version number at the beginning, so that the file format can be changed later without breaking compatibility
  • specify the endianness and size of all data items; or: include some space to describe endianness/size of data (I would tend towards the former)
  • possibly reserve some space for further per-file attributes that might be necessary in the future?

What else would be useful to make the format more future-proof and minimize headache in the future?

like image 901
oliver Avatar asked Nov 27 '08 12:11

oliver


People also ask

What is the point of binary files?

Binary files can be used to store any data; for example, a JPEG image is a binary file designed to be read by a computer system. The data inside a binary file is stored as raw bytes, which is not human readable.

What are the components of a binary file?

A binary file, however, can have text strings (ASCII and/or Unicode). Most of operating system files are binary files. They include drivers, core components, service applications and user tools. There also exist binary data files - files that contain certain binary structures - areas of bytes, words or even arrays.

When should binary file format be used?

A binary format is a format in which file information is stored in the form of ones and zeros, or in some other binary (two-state) sequence. This type of format is often used for executable files and numeric information in computer programming and memory.

How are binary files formatted?

A binary file is a file whose content is in a binary format consisting of a series of sequential bytes, each of which is eight bits in length. The content must be interpreted by a program or a hardware processor that understands in advance exactly how that content is formatted and how to read the data.


2 Answers

Take a look at the PNG spec. This format has some very good rationale behind it.

Also, decide what's important for your future format: compactness, compatibility, allowing to embed other formats (different compression algorithms) inside it. Another interesting example would be the Google's protocol buffers, where size of the transferred data is the king.

As for endianness, I'd suggest you to pick one option and stick with it, not allowing different byte orders. Otherwise, reading and writing libraries will only get more complex and slower.

like image 97
Stepan Stolyarov Avatar answered Sep 23 '22 13:09

Stepan Stolyarov


I agree that these are good ideas:

  1. Magic numbers at the beginning. Pretty much required in *nix:

  2. File version number for backwards compatibility.

  3. Endianness specification.

But your fourth one is overkill, because #2 lets you add fields as long as you change the version number (and as long as you don't need forward compatibility).

  • possibly reserve some space for further per-file attributes that might be necessary in the future?

Also, the idea of imposing a block-structure on your file, expressed in many other answers, seems less like a universal requirement for binary files than a solution to a problem with certain kinds of payloads.

In addition to 1-3 above, I'd add these:

  • simple checksum or other way of detecting that the contents are intact. Otherwise you can't trust magic bytes or version numbers. Be careful to spec which bytes are included in the checksum. Typically you would include all bytes in the file that don't already have error detection.

  • version of your software (including the most granular number you have, e.g. build number) that wrote the file. You're going to get a bug report with an attached file from someone who can't open it and they will have no clue when they wrote the file because the error didn't occur then. But the bug is in the version that wrote it, not in the one trying to read it.

  • Make it clear in the spec that this is a binary format, i.e. all values 0-255 are allowed for all bytes (except the magic numbers).

And here are some optional ones:

  • If you do need forward compatibility, you need some way of expressing which "chunks" are "optional" (like png does), so that a previous version of your software can skip over them gracefully.

  • If you expect these files to be found "in the wild", you might consider embedding some clue to find the spec. Imagine how helpful it would be to find the string http://www.w3.org/TR/PNG/ in a png file.

like image 20
Bart Avatar answered Sep 22 '22 13:09

Bart