Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Name me a Binary Parser. A parser for binary data [closed]

So, I'm getting this data. From the network socket, or out of a file. I'm cobbling together code that will interpret the data. Read some bytes, check some flags, and some bytes indicate how much data follows. Read in that much data, rinse, repeat.

This task reminds me much to parsing source code. I'm comfy with lex/yacc and antlr, but they're not up to this task. You can't specify bits and raw bytes as tokens (well, maybe you could, but I wouldn't know how), and you can't coax them into "read two bytes, make them into an unsigned 16bit integer, call it n, and then read n bytes.".

Then again, when the spec of the protocol/data format is defined in a systematic manner (not all of them are), there should be a systematic way to read in data that is formatted according to the protocol. Right?

There's gotta be a tool that does that.

like image 801
doppelfish Avatar asked Feb 06 '10 21:02

doppelfish


4 Answers

Kaitai Struct initiative have emerged recently to solve exactly that task: to generate binary parsers from a spec. You can provide a scheme for serialization of arbitrary data structure in a YAML/JSON-based format like that:

meta:
  id: my_struct
  endian: le
seq:
  - id: some_int
    type: u4
  - id: some_string
    type: str
    encoding: UTF-8
    size: some_int + 4
  - id: another_int
    type: u4

compile it using ksc (they provide a reference compiler implementation), and, voila, you've got a parser in any supported programming language, for example, in C++:

my_struct_t::my_struct_t(kaitai::kstream *p_io, kaitai::kstruct *p_parent, my_struct_t *p_root) : kaitai::kstruct(p_io) {
    m__parent = p_parent;
    m__root = this;
    m_some_int = m__io->read_u4le();
    m_some_string = m__io->read_str_byte_limit((some_int() + 4), "UTF-8");
    m_another_int = m__io->read_u4le();
}

or in Java:

private void _parse() throws IOException {
    this.someInt = this._io.readU4le();
    this.someString = this._io.readStrByteLimit((someInt() + 4), "UTF-8");
    this.anotherInt = this._io.readU4le();
}

After adding that to your project, it provides a very intuitive API like that (an example in Java, but they support more languages):

// given file.dat contains 01 00 00 00|41 42 43 44|07 01 00 00

MyStruct s = MyStruct.fromFile("path/to/file.dat");
s.someString() // => "ABCD"
s.anotherInt() // => 263 = 0x107

It supports different endianness, conditional structures, substructures, etc, and lots more. Pretty complex data structures, such as PNG image file format or PE executable can be parsed.

like image 188
dpm_min Avatar answered Nov 15 '22 16:11

dpm_min


You may try to employ Boost.Spirit (v2) which has recently got binary parsing tools, endianness-aware native and mixed parsers

// This is not a complete and useful example, but just illustration that parsing
// of raw binary to real data components is possible
typedef boost::uint8_t byte_t;
byte_t raw[16] = { 0 };
char const* hex = "01010000005839B4C876BEF33F83C0CA";
my_custom_hex_to_bytes(hex, raw, 16);

// parse raw binary stream bytes to 4 separate words
boost::uint32_t word(0);
byte_t* beg = raw;
boost::spirit::qi::parse(beg, beg + 16, boost::spirit::qi::dword, word))

UPDATE: I found similar question, where Joel de Guzman confirms in his answer availability of binary parsers: Can Boost Spirit be used to parse byte stream data?

like image 27
mloskot Avatar answered Nov 15 '22 16:11

mloskot


The Construct parser, written in Python, has done some interesting work in this field.

The project has had a number of authors, and periods of stagnation, but as of 2017 it seems to be more active again.

like image 44
Craig McQueen Avatar answered Nov 15 '22 17:11

Craig McQueen


Read up on ASN.1. If you can describe the binary data in its terms, you can then use various available kits. Not for the faint of heart.

like image 1
bmargulies Avatar answered Nov 15 '22 15:11

bmargulies