I am working on a very simple decompiler for MIPS architecture and as I progress I have to define lots of rules for code analysis, for example "if this opcode is lui and next opcode is addiu then return var = value" or "if this opcode is bne and it's referring to address before current - create loop definition in parsing tree". The problem - there are tons of such rules and I can't find a good way to define them. I've tried writing separated functions for every rule, defining nice OOP base logic classes and extending them to create rules, even tried regular expressions on disasmed code(to my surprise this works better than expected) but no matter what I've tried, my code soon became to big and to hard to read no matter how well I am trying to document and structure it.
This brings me to conclusion, that I am trying to solve this task by using wrong tools(not to mention being too stupid for such complex task :) ), but I have no real idea what should I try. Currently I have two untested ideas, one is using some kind of DSL(I have absolutely no experience in this, so I can be totally wrong), and another is writing some kind of binary regexp-like tools for opcode matching.
I hope someone can point me in correct direction, thx.
A decompiler is a programming tool that converts an executable program or low-level/machine language into a format understandable to software programmers. It performs theoperations of a compiler, which translates source code into an executable format, but in reverse.
It is also not possible to decompile all programs. Furthermore, it is not easy to separate data and code because both are represented similarly in most current computer systems. A type of reverse engineering, a decompiler performs the opposite operations of a compiler.
Show activity on this post. Decompilation is difficult because decompilers must recover source-code abstractions that are missing from the binary/bytecode target.
I would guess that some of your rules are too low-level, and that's why they're becoming unmanageable.
Recognising lui
followed by addiu
as a 32-bit constant load certainly seems very reasonable; but trying to derive control flow from branch instructions at the individual opcode level seems rather more suspect - I think you want to be working with basic blocks there.
Cifuentes' Reverse Compilation Techniques is a reference which keeps cropping up in discussions of decompilation that I've seen; from a fairly brief skim, it seems like it would be well worth spending some time reading in detail for your project.
Some of the x86-specific stuff won't be relevant - in particular, the step which translates x86 to a low-level intermediate representation is probably not necessary for MIPS (MIPS is essentially just one basic operation per opcode already) - but otherwise much of the content looks like it should be very useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With