I have a simulation that reads large binary data files that we create (10s to 100s of GB). We use binary for speed reasons. These files are system dependent, converted from text files on each system that we run, so I'm not concerned about portability. The files currently are many instances of a POD struct, written with fwrite.
I need to change the struct, so I want to add a header that has a file version number in it, which will be incremented anytime the struct changes. Since I'm doing this, I want to add some other information as well. I'm thinking of the size of the struct, byte order, and maybe the svn version number of the code that created the binary file. Is there anything else that would be useful to add?
In my experience, second-guessing the data you'll need is invariably wasted time. What's important is to structure your metadata in a way that is extensible. For XML files, that's straightforward, but binary files require a bit more thought.
I tend to store metadata in a structure at the END of the file, not the beginning. This has two advantages:
The simplest metadata footer I use looks something like this:
struct MetadataFooter{
char[40] creatorVersion;
char[40] creatorApplication;
.. or whatever
}
struct FileFooter
{
int64 metadataFooterSize; // = sizeof(MetadataFooter)
char[10] magicString; // a unique identifier for the format: maybe "MYFILEFMT"
};
After the raw data, the metadata footer and THEN the file footer are written.
When reading the file, seek to the end - sizeof(FileFooter). Read the footer, and verify the magicString. Then, seek back according to metadataFooterSize and read the metadata. Depending on the footer size contained in the file, you can use default values for missing fields.
As KeithB points out, you could even use this technique to store the metadata as an XML string, giving the advantages of both totally extensible metadata, with the compactness and speed of binary data.
As my experience with telecom equipment configuration and firmware upgrades shows you only really need several predefined bytes at the begin (this is important) which starts from version (fixed part of header). Rest of header is optional, by indicating proper version you can always show how to process it. Important thing here is you'd better place 'variable' part of header at the end of file. If you plan operations on header without modifying file content itself. Also this simplify 'append' operations which should recalculate variable header part.
Nice to have features for fixed size header (at the begin):
OK, for variable part XML or some pretty extensible format in header is good idea but is it really needed? I had lot of experience with ASN encoding... in most cases its usage was overshot.
Well, maybe you will have additional understanding when you look at things like TPKT format which is described in RFC 2126 (chapter 4.3).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With