Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C#/.NET - Custom Binary File Formats - Where to Start?

I need to be able to store some data in a custom binary file format. I've never designed my own file format before. It needs to be a friendly format for traveling between the C#, Java and Ruby/Perl/Python worlds.

To start with the file will consist of records. A GUID field and a JSON/YAML/XML packet field. I'm not sure what to use as delimiters. A comma, tab or newline kind of thing seems too fragile. What does Excel do? or the pre-XML OpenOffice formats? Should you use ASCII chars 0 or 1. Not sure where to begin. Any articles or books on the topic?

This file format may expand later to include a "header section".

Note: To start with I'll be working in .NET, but I'd like the format to be easily portable.

UPDATE:
The processing of the "packets" can be slow, but navigation within the file format cannot. So I think XML is off the table.

like image 357
BuddyJoe Avatar asked Apr 27 '09 19:04

BuddyJoe


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

Is C language easy?

Compared to other languages—like Java, PHP, or C#—C is a relatively simple language to learn for anyone just starting to learn computer programming because of its limited number of keywords.

What is C language?

C is an imperative procedural language supporting structured programming, lexical variable scope, and recursion, with a static type system. It was designed to be compiled to provide low-level access to memory and language constructs that map efficiently to machine instructions, all with minimal runtime support.

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.


2 Answers

How about looking at using "protocol buffers"? Designed as an efficient, portable, version-tolerant general purpose binary format, it gives you C++, Java and Python in the google library, and C#, Perl, Ruby and others in the community ports?

Note that Guid doesn't have a specific data type, but you can shim it as a message with (essentially) a byte[].

Normally for .NET work, I'd recommend protobuf-net (but as the author, I'm somewhat biased) - however, if you intend to use other languages later you might do better (long term) using Jon's dotnet-protobufs; that'll give you a familiar API accross the platforms (where-as protobuf-net uses .NET idioms).

like image 54
Marc Gravell Avatar answered Sep 30 '22 07:09

Marc Gravell


I'll try to add some general hints for creating a portable binary file format.

Note that to invent a binary file format means to document, how the bits in it must go and what they mean. It's not coding, but documentation.

Now the hints:

  1. Decide what to do with endianess. Good and simple way to go is to decide it once and forever. The choice would be preferably little endian when used on common PC (that is x86) to save conversions (performance).

  2. Create header. Yes, it is good idea to always have a header. First bytes of the file should be able to tell you, what format you are messing with.

    • Start with magic to be able to recognize your format (ASCII string will do the trick)
    • Add version. Version of your file format will not hurt to add and it will allow you to do backward compatibility later.
  3. Finally, add the data. Now, the format of the data will be specific and it will always be based on your exact needs. Basically, the data will be stored in a binary image of some data structure. The data structure is what you need to come up with.

If you need random access to your data by some sort of indices, B-Trees are way to go, while if you just need a lot of numbers to write them all and then read them all an "array" will do the trick.

Additionally, you might use a TLV (Type-Length-Value) concept for forward compatibility.

like image 27
Jan Smrčina Avatar answered Sep 30 '22 09:09

Jan Smrčina