So, I'm getting this data. From the network socket, or out of a file. I'm cobbling together code that will interpret the data. Read some bytes, check some flags, and some bytes indicate how much data follows. Read in that much data, rinse, repeat.
This task reminds me much to parsing source code. I'm comfy with lex/yacc and antlr, but they're not up to this task. You can't specify bits and raw bytes as tokens (well, maybe you could, but I wouldn't know how), and you can't coax them into "read two bytes, make them into an unsigned 16bit integer, call it n, and then read n bytes.".
Then again, when the spec of the protocol/data format is defined in a systematic manner (not all of them are), there should be a systematic way to read in data that is formatted according to the protocol. Right?
There's gotta be a tool that does that.
Kaitai Struct initiative have emerged recently to solve exactly that task: to generate binary parsers from a spec. You can provide a scheme for serialization of arbitrary data structure in a YAML/JSON-based format like that:
meta:
id: my_struct
endian: le
seq:
- id: some_int
type: u4
- id: some_string
type: str
encoding: UTF-8
size: some_int + 4
- id: another_int
type: u4
compile it using ksc
(they provide a reference compiler implementation), and, voila, you've got a parser in any supported programming language, for example, in C++:
my_struct_t::my_struct_t(kaitai::kstream *p_io, kaitai::kstruct *p_parent, my_struct_t *p_root) : kaitai::kstruct(p_io) {
m__parent = p_parent;
m__root = this;
m_some_int = m__io->read_u4le();
m_some_string = m__io->read_str_byte_limit((some_int() + 4), "UTF-8");
m_another_int = m__io->read_u4le();
}
or in Java:
private void _parse() throws IOException {
this.someInt = this._io.readU4le();
this.someString = this._io.readStrByteLimit((someInt() + 4), "UTF-8");
this.anotherInt = this._io.readU4le();
}
After adding that to your project, it provides a very intuitive API like that (an example in Java, but they support more languages):
// given file.dat contains 01 00 00 00|41 42 43 44|07 01 00 00
MyStruct s = MyStruct.fromFile("path/to/file.dat");
s.someString() // => "ABCD"
s.anotherInt() // => 263 = 0x107
It supports different endianness, conditional structures, substructures, etc, and lots more. Pretty complex data structures, such as PNG image file format or PE executable can be parsed.
You may try to employ Boost.Spirit (v2) which has recently got binary parsing tools, endianness-aware native and mixed parsers
// This is not a complete and useful example, but just illustration that parsing
// of raw binary to real data components is possible
typedef boost::uint8_t byte_t;
byte_t raw[16] = { 0 };
char const* hex = "01010000005839B4C876BEF33F83C0CA";
my_custom_hex_to_bytes(hex, raw, 16);
// parse raw binary stream bytes to 4 separate words
boost::uint32_t word(0);
byte_t* beg = raw;
boost::spirit::qi::parse(beg, beg + 16, boost::spirit::qi::dword, word))
UPDATE: I found similar question, where Joel de Guzman confirms in his answer availability of binary parsers: Can Boost Spirit be used to parse byte stream data?
The Construct parser, written in Python, has done some interesting work in this field.
The project has had a number of authors, and periods of stagnation, but as of 2017 it seems to be more active again.
Read up on ASN.1. If you can describe the binary data in its terms, you can then use various available kits. Not for the faint of heart.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With