Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What to put in a binary data file's header

Tags:

c++

c

binaryfiles

I have a simulation that reads large binary data files that we create (10s to 100s of GB). We use binary for speed reasons. These files are system dependent, converted from text files on each system that we run, so I'm not concerned about portability. The files currently are many instances of a POD struct, written with fwrite.

I need to change the struct, so I want to add a header that has a file version number in it, which will be incremented anytime the struct changes. Since I'm doing this, I want to add some other information as well. I'm thinking of the size of the struct, byte order, and maybe the svn version number of the code that created the binary file. Is there anything else that would be useful to add?

like image 417
KeithB Avatar asked Jan 06 '09 13:01

KeithB


2 Answers

In my experience, second-guessing the data you'll need is invariably wasted time. What's important is to structure your metadata in a way that is extensible. For XML files, that's straightforward, but binary files require a bit more thought.

I tend to store metadata in a structure at the END of the file, not the beginning. This has two advantages:

  • Truncated/unterminated files are easily detected.
  • Metadata footers can often be appended to existing files without impacting their reading code.

The simplest metadata footer I use looks something like this:

struct MetadataFooter{
  char[40] creatorVersion;
  char[40] creatorApplication;
  .. or whatever
} 

struct FileFooter
{
  int64 metadataFooterSize;  // = sizeof(MetadataFooter)
  char[10] magicString;   // a unique identifier for the format: maybe "MYFILEFMT"
};

After the raw data, the metadata footer and THEN the file footer are written.

When reading the file, seek to the end - sizeof(FileFooter). Read the footer, and verify the magicString. Then, seek back according to metadataFooterSize and read the metadata. Depending on the footer size contained in the file, you can use default values for missing fields.

As KeithB points out, you could even use this technique to store the metadata as an XML string, giving the advantages of both totally extensible metadata, with the compactness and speed of binary data.

like image 175
Roddy Avatar answered Sep 27 '22 02:09

Roddy


As my experience with telecom equipment configuration and firmware upgrades shows you only really need several predefined bytes at the begin (this is important) which starts from version (fixed part of header). Rest of header is optional, by indicating proper version you can always show how to process it. Important thing here is you'd better place 'variable' part of header at the end of file. If you plan operations on header without modifying file content itself. Also this simplify 'append' operations which should recalculate variable header part.

Nice to have features for fixed size header (at the begin):

  • Common 'length' field (including header).
  • Something like CRC32 (including header).

OK, for variable part XML or some pretty extensible format in header is good idea but is it really needed? I had lot of experience with ASN encoding... in most cases its usage was overshot.

Well, maybe you will have additional understanding when you look at things like TPKT format which is described in RFC 2126 (chapter 4.3).

like image 42
Roman Nikitchenko Avatar answered Sep 23 '22 02:09

Roman Nikitchenko