Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create file header (metadata of file) in C

Tags:

c

file

file-io

File header contains the all data about the file — metadata. I want to create a blank file with metadata, then I want to add other file content to this blank file and need to change (modify) metadata. Is there any library in C for creating file header? How to read/write file header in C?

metadata = {
    file_name;
    file_size;
    file_type;
    file_name_size;
    total_files;
}
like image 629
Nimit Avatar asked Mar 09 '12 05:03

Nimit


1 Answers

There are probably a number of libraries that handle specific file formats, such as the variations on tar, but not one that will be adapted to your particular header format.

You will need to decide, first, whether your metadata is of a fixed or variable size.

If it is a fixed size, it is relatively easy to skip over that many bytes at the start, write the rest of the file, and then rewind and fill in the metadata. If the only variable size parts are known at the start, you can handle it much the same way - write the first version then go back when you're done and write the final version.

If you won't know the size of the variable material until the end, you are in some difficulty. You probably end up writing a temporary file with the bulk of the file, then when you're done and know all the variable size metadata, you write the metadata header to a new (the final) file, then copy the temporary file after the metadata.

Note that you should place the size (length) of the file name before the actual file name in the data on disk. Then you can read how big the name is and allocate the right space and read the correct amount of data. Placing the length of the file name after the file name itself really doesn't help very much.

You also need to think whether your header will be binary data or text. The file name component will be text, but the number could be 2-byte or 4-byte binary values, or ASCII (plain text) variable length equivalents. It is usually easier to debug text representations, but it is more likely that you'll want variable length data if you do use text. However, you can always use a fixed size with blank padding too. One other advantage of text over binary is that text is portable across machine architectures, whereas binary brings up questions of big-endian vs little-endian machines, and so on an so forth.

You should also consider using a 'magic number' to allow you to identify that the file contains the right sort of data. The 'number' might be an ASCII string, like the !<arch>\n used in some versions of ar headers, for example. Or the %PDF-1.3\n used at the start of a PDF file. Having said that, tar largely gets away without a magic number in the first bytes, but that is an unusual design these days. The file program knows a lot about magic numbers. Its data is sometimes found in a file - such as the files under /usr/share/file for Mac OS X.


Can you please explain by any example?

One file format I deal with is for messages identified by a 32-bit (signed) number, with variable lengths for the messages and therefore variable offsets. The file is written in a platform-neutral but binary format. The numbers are written big-endian, with the MSB first. The message numbers are currently constrained to the range ±99,999 (so there is room for just under 200,000 messages in the system as a whole).

The header of the file contains:

  • 2-byte (unsigned) magic number
  • 2-byte (unsigned) count of the number of messages contained in the file, N

It is followed by N entries, each of which describes a message:

  • 4-byte (signed) message number
  • 2-byte (unsigned) message length
  • 4-byte (unsigned) offset to the start of the message

The N entries are in sorted order of message number, but there is no requirement that the message numbers be contiguous. Missing numbers are simply missing.

After the N entries, the actual message texts follow, each consisting of the appropriate number of bytes identified by the corresponding entry plus an ASCII NUL '\0' byte.

As the file is generated, the text of each message is written out to an intermediate file in the order processed, recording the offset of the message in the file. It doesn't matter whether the messages are read or written in order; all that matters is that the offset from the end of the header is recorded in a header record. Once all the messages have been read in, the in-memory copy of the file entries can be sorted into numeric order, and the final file can be written. First there is the magic number and the number of messages; then N entries describing the messages; followed by the text of the messages copied from the intermediate file.

Reading a message number M is simple enough. You do a binary search through the N entries to find the entry for M. If it isn't there, so be it - that's an error. If it is there, you know where to find it in the file and how long it is.

The fact that the data is in a fixed but binary format doesn't really complicate things. You use the same functions on both big-endian and little-endian machines to read the number into native format. In theory, you could optimize for a big-endian machine, but only if the machine doesn't have problems with insufficiently aligned data. It is simpler to forget that the optimization might be possible and simply use the same code everywhere.


If the format described above was converted to a text format, then it would probably have 8 bytes (say) reserved for the magic number (which might well be a 7-letter string followed by a newline), and 6 bytes reserved for the number of messages (5 digits plus a newline). Each of the message entries could be reserved 6 bytes for the message number (±99,999 for the number), plus a space, plus 4 bytes for the length (maximum, 8KiB) plus a space, plus an offset in 8 bytes (7 digits plus a newline).

MAGICNO
12345
-99999 8000 0000000
-90210   38 0008000
...

Again, the advantage of the text file for readability is that you can look at the file and see the meaning of the data quite easily.

You can have endless variations on this theme.

like image 69
Jonathan Leffler Avatar answered Sep 30 '22 12:09

Jonathan Leffler