Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LLVM IR: Identifying Variables with Metadata Nodes

Tags:

c++

llvm

llvm-ir

Currently I'm working on a tool which identifies load and store accesses on global and field variables on arbitrary programs. Furthermore, the accessed variables should be identified by their source level names/identifiers. In order to accomplish this I compile the source code of the program under diagnosis into LLVM IR with debug information. So far so good, the generated Metadata Nodes contain the desired source level identifiers. However, I'm unable to draw connections to some LLVM IR identifiers and the information in the meta data.

For example, consider a satic member of a class:

 class TestClass {
   public:
    static int Number;
};

The corresponding LLVM IR looks like this:

@_ZN12TestClass6NumberE = external global i32, align 4

...
!15 = !DIDerivedType(tag: DW_TAG_member, name: "Number", scope: !"_ZTS12TestClass", file: !12, line: 5, baseType: !16, flags: DIFlagPublic | DIFlagStaticMember)

In this controlled example I know that "@_ZN12TestClass6NumberE" is an identifier for "Number". However, in general I fail to see how I can find out which IR identifiers correspond to which meta data.

Can somebody help me out?

like image 555
NicoKop Avatar asked Jan 03 '16 15:01

NicoKop


1 Answers

Since no one seems to have a good solution to my problem I will tell my own inconvient approach to handle this problem. LLVM's generated MetaData nodes contain information about the defined types and variables of the code. However, there is no information about which generated IR variables correspond to which source code variables. LLVM merely links metadata information of IR instructions with correspdoning source locations (lines and columns). This makes sense, since the main task of LLVMs metadata is not analysis but debugging.

Still, the contained information is not useless. My solution to this problems is to use the clang AST for the analysis of the source code. Here we gain information about which variable is accessed at which source location. So, in order to get information about the source variable identities during LLVM IR instrumentation, we just need to map the source locations to source variable identites during the clang AST analysis. As a second step we perform the IR instrumentation by using our previously gathered information. When we encounter a store or load instruction in the IR, we search in the metadata node of this instruction for its corresponding source location. Since we have mapped source locations to source variable identities, we can now easily access the source variable identity of the IR instruction.

So, why do I not just use clang AST for identifying stores and loads on variables? Because distinguishing reads and writes in the AST is not a simple task. The AST can easily tell you that a variable is accessed but it depends on the operation whether the accessed variable is read or written. So, I would have to consider every single operation/operator to determine whether the variable is written/read or both. LLVM is much simpler, more low-level, in this regard and as such less error-prone. Furthermore, actual instrumentation (speaking code insertion) is much more difficult in the AST as it is with LLVM. Because of these two reasons, I believe that a combination of clang AST and LLVM IR instrumentation is the best solution for my problem.

like image 190
NicoKop Avatar answered Nov 18 '22 17:11

NicoKop