I am playing around with smali and baksmali on a small Hello World Android application I have written. My source code is:
package com.hello;
import android.app.Activity;
import android.os.Bundle;
public class Main extends Activity {
/** Called when the activity is first created. */
@Override
public void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.main);
}
}
which was then disassembled to:
.class public Lcom/hello/Main;
.super Landroid/app/Activity;
.source "Main.java"
# direct methods
.method public constructor <init>()V
.locals 0
.prologue
.line 6
invoke-direct {p0}, Landroid/app/Activity;-><init>()V
return-void
.end method
# virtual methods
.method public onCreate(Landroid/os/Bundle;)V
.locals 1
.parameter "savedInstanceState"
.prologue
.line 10
invoke-super {p0, p1}, Landroid/app/Activity;->onCreate(Landroid/os/Bundle;)V
.line 11
const/high16 v0, 0x7f03
invoke-virtual {p0, v0}, Lcom/hello/Main;->setContentView(I)V
.line 12
return-void
.end method
I understand that this is some kind of Intermediate Representation but am not sure what it is. As I understand there must be some specification on how to understand this representation but am unable to figure out how to search for it. So given an apk file, can someone explain in layman terms on how the Dalvik opcode specification is used to arrive at this representation? My current understanding is this:
Any information (perhaps with some simple examples) on the above two steps would help me in a great way in getting the concepts right.
Update 1 (posted after the reply from Chris):
So essentially, I would do the following to arrive at the Dalvik bytecode:
Then the disassembler reads the classes.dex file and determines all the classes present in the apk. Can you provide me some information on how this is done? Does it parse the file in hex mode and lookup the Dalvik specification and then resolve appropriately? Or is something else happening? For instance, when I used hexdump on classes.dex, it gave me something like this:
64 65 78 0a 30 33 ...
Are these now used for Opcode lookups?
Actually, in short, I am interested in knowing how all this "magic" is done. So for instance, if I were to learn to write this tool, what is the high-level roadmap I should follow?
What you're looking at is the davlik bytecode. Java code is translated to Dalvik bytecode by the dx tool. The manifest is a separate issue which I'll get to in a minute. Effectively, when you compile your Android application, the dx tool converts your Java code into bytecode (the same way that javac converts Java to Java bytecode for a standard JVM application) using the 256 dalvik opcodes.
For example, invoke-super
is an opcode that instructs the dvm (dalvik virtual machine) to invoke a method on the super class. Similarly, invoke-interface
instructs the dvm to invoke an interface method.
So you can see that
super.onCreate(savedInstanceState);
translates to
invoke-super {p0, p1}, Landroid/app/Activity;->onCreate(Landroid/os/Bundle;)
In this case, invoke-super
takes two parameters, the {p0,p1
group and the Landroid/app/Activity;->onCreate(Landroid/os/Bundle;)
parameter which is the method specification which it uses to look up and resolve the method if necessary.
Then there's the invoke-direct
call in the constructor area.
invoke-direct {p0}, Landroid/app/Activity;-><init>()V
Every class has an init
method that is used to initialize the class's data members, also known as the constructor. When you construct a class, the virtual machine must also call the constructor of the superclass. This explains why the constructor for your class calls the Activity
constructor.
With regards to the manifest, what happens (this is all in the Dalvik specs if you check out the source code) is that the compiler (that generates the apk file) converts the manifest to a more compressed format (binary xml) for the purposes of saving space. The manifest doesn't have anything to do with the code you posted, it more instructs the dvm on how to process the application is a whole with regards to Activities
, Services
, etc. What you've posted is what actually gets executed.
That's a high-level answer to your question. If you need more, let me know and I'll do my best.
Edit You're basically right. The decompiler reads the binary data as a byte stream from the dex file. It has an understanding of what the format should be and is able to pull out information like constants, classes, etc. With regards to the opcodes, that's exactly what it does. It understand what the byte value for each opcode is (or how it's represented in the dex file) and is able to convert that into a human-readable string. If you were going to implement this, aside from understanding the general basics of compilers, I would start with a deep understanding of the structure of a dex file. From there, you would need to construct a table that matches opcode values with the human-readable string. With that information and some additional information regarding string constants, etc. you could construct a text-file representation of the compiled class. Does that make sense?
The opcode specification only describes the instructions. The dex file format is more than that - it contains all the metadata needed for the Dalvik VM (and the disassembler) to interpret the file - strings, classes, types, methods and so on. See also the official opcode spec, it's more complete and verbose than the one you linked.
<plug>
BTW, the next version of IDA Pro will support disassembly of .dex files</plug>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With