Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing binary files with Python

As a side project I would like to try to parse binary files (Mach-O files specifically). I know tools exist for this already (otool) so consider this a learning exercise.

The problem I'm hitting is that I don't understand how to convert the binary elements found into a python representation. For example, the Mach-O file format starts with a header which is defined by a C Struct. The first item is a uint_32 'magic number' field. When i do

magic = f.read(4)

I get

b'\xcf\xfa\xed\xfe'

This is starting to make sense to me. It's literally a byte array of 4 bytes. However I want to treat this like a 4-byte int that represents the original magic number. Another example is the numberOfSections field. I just want the number represented by 4-byte field, not an array of literal bytes.

Perhaps I'm thinking about this all wrong. Has anybody worked on anything similar? Do I need to write functions to look these 4-byte byte arrays and shift and combine their values to produce the number I want? Is endienness going to screw me here? Any pointers would be most helpful.

like image 921
D.C. Avatar asked Aug 21 '11 20:08

D.C.


People also ask

How do you decode a binary file in Python?

You can open the file using open() method by passing b parameter to open it in binary mode and read the file bytes. open('filename', "rb") opens the binary file in read mode.

How do I read an entire binary file in Python?

To read from a binary file, we need to open it with the mode rb instead of the default mode of rt : >>> with open("exercises. zip", mode="rb") as zip_file: ... contents = zip_file. read() ...

Can Python handle binary files?

Python File I/O - Read and Write Files. In Python, the IO module provides methods of three types of IO operations; raw binary files, buffered binary files, and text files. The canonical way to create a file object is by using the open() function.


2 Answers

There's Kaitai Struct project that solves exactly that problem. First, you describe a certain file format using a .ksy spec, then you compile it into a Python library (or, actually, a library in any other major programming language), import it, and, voila, parsing boils down to:

from mach_o import MachO
my_file = MachO.from_file("/path/to/your/file")
my_file.magic # => 0xfeedface
my_file.num_of_sections # => some other integer
my_file.sections # => list of objects that represent sections

They have a growing repository of file format specs. It doesn't have Mach-O file format spec (yet?), but there are complex formats like Java .class or Microsoft's PE executable described there, so I guess it shouldn't be a major problem to write spec for Mach-O format as well.

It is actually better than Construct or Hachoir, because it's compiled (as opposed to interpreted), thus it's faster, and it includes tons of other helpful tools like visualizer or format diagram maker. For example, this is a generated explanation diagram for PE executable format:

PE executable format

like image 199
chosen 0x4f Avatar answered Oct 06 '22 02:10

chosen 0x4f


I would suggest the Construct module. It offers a very high level interface.

like image 41
krase Avatar answered Oct 06 '22 01:10

krase