Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse LLVM IR line by line

I specifically need to parse LLVM IR code line by line during runtime of my c++ code where I need to know what operation is happening on what operands on each line.

For example, if the IR code is:

%0 = load i32* %a, align 4

I would like to know that value from %a is being loaded to %0 during runtime of my c++ code. I have considered using a simple text parsing c++ program to do this (Parse IR and search for IR keywords) but would like to know if there are any existing libraries (Possibly from LLVM itself) that will help me avoid doing this.

like image 777
Saksham Jain Avatar asked May 12 '15 15:05

Saksham Jain


1 Answers

Assumption

Theoretically, we could directly take advantage of the LLVM::LLLexer to write our own parser for the LLVM IR for the line by line parsing.

The following answer assume you are only interested in the operations inside each function of the LLVM IR file, since the other information in the LLVM IR file contains nothing about the operation. Operation could only be located in a function. For other parts of the IR, such as structure definition, function declaration, etc, they only has information about the types, and does not contain anything about operations.

Implementation

Based on the above assumption, your question about parsing LLVM IR line by line for the operation information in the IR file could be translate to parsing each operations in each function of the LLVM IR file.

LLVM does has an existing implementation for directly parse LLVM IR file line by line to get information about the operations directly, and since the sequence of the functions of the IR file are what they appear in the LLVM IR file, the operation sequence output by the following implementation is just the operation sequence in the given LLVM IR file.

Therefore we could take advantage of the parseBitcodeFile interface provided by llvm. Such interface will firstly use an LLVM::LLLexer to split the LLVM IR file into tokens, and then feed the token to Parser for analyse, and finally generate a ErrorOr<llvm::Module *> module information, the sequence of function list in the module is the same as the sequence in the llvm ir file.

Then we could each LLVM::BasicBlock of each LLVM::Function in the LLVM::Module. And then iterate each LLVM::Instruction, and get the information about Each operand LLVM::Value. Following is the implementation code.

#include <iostream>
#include <string>
#include <llvm/Support/MemoryBuffer.h>
#include <llvm/Support/ErrorOr.h>
#include <llvm/IR/Module.h>
#include <llvm/IR/LLVMContext.h>
#include <llvm/Bitcode/ReaderWriter.h>
#include <llvm/Support/raw_ostream.h>

using namespace llvm;

int main(int argc, char *argv[]) {
  if (argc != 2) {
    std::cerr << "Usage: " << argv[0] << "bitcode_filename" << std::endl;
    return 1;
  }
  StringRef filename = argv[1];
  LLVMContext context;

  ErrorOr<std::unique_ptr<MemoryBuffer>> fileOrErr =
    MemoryBuffer::getFileOrSTDIN(filename);
  if (std::error_code ec = fileOrErr.getError()) {
    std::cerr << " Error opening input file: " + ec.message() << std::endl;
    return 2;
  }
  ErrorOr<llvm::Module *> moduleOrErr =
      parseBitcodeFile(fileOrErr.get()->getMemBufferRef(), context);
  if (std::error_code ec = fileOrErr.getError()) {
    std::cerr << "Error reading Moduule: " + ec.message() << std::endl;
    return 3;
  }

  Module *m = moduleOrErr.get();
  std::cout << "Successfully read Module:" << std::endl;
  std::cout << " Name: " << m->getName().str() << std::endl;
  std::cout << " Target triple: " << m->getTargetTriple() << std::endl;

  for (auto iter1 = m->getFunctionList().begin();
       iter1 != m->getFunctionList().end(); iter1++) {
    Function &f = *iter1;
    std::cout << " Function: " << f.getName().str() << std::endl;
    for (auto iter2 = f.getBasicBlockList().begin();
         iter2 != f.getBasicBlockList().end(); iter2++) {
      BasicBlock &bb = *iter2;
      std::cout << "  BasicBlock: " << bb.getName().str() << std::endl;
      for (auto iter3 = bb.begin(); iter3 != bb.end(); iter3++) {
        Instruction &inst = *iter3;
        std::cout << "   Instruction " << &inst << " : " << inst.getOpcodeName();

    unsigned int  i = 0;
    unsigned int opnt_cnt = inst.getNumOperands();
        for(; i < opnt_cnt; ++i)
        {
          Value *opnd = inst.getOperand(i);
          std::string o;
          //          raw_string_ostream os(o);
          //         opnd->print(os);
          //opnd->printAsOperand(os, true, m);
          if (opnd->hasName()) {
            o = opnd->getName();
            std::cout << " " << o << "," ;
          } else {
            std::cout << " ptr" << opnd << ",";
          }
        }
        std:: cout << std::endl;
      }
    }
  }
  return 0;
}

Please use the following command to generate the executable:

clang++ ReadBitCode.cpp -o reader `llvm-config --cxxflags --libs --ldflags --system-libs`

Take the following c code as an example:

struct a {
  int f_a;
  int f_b;
  char f_c:5;
  char f_d:4;
};

int my_func( int arg1, struct a obj_a) {
  int x = arg1;
  return x+1 + obj_a.f_c;
}

int main() {
  int a = 11;
  int b = 22;
  int c = 33;
  int d = 44;
  struct a obj_a;
  obj_a.f_a = 1;
  obj_a.f_b = 2;
  obj_a.f_c = 3;
  obj_a.f_c = 4;
  if ( a > 10 ) {
    b = c;
  } else {
    b = my_func(d, obj_a);
  }
  return b;
}

After the following command, we could get some output:

clang -emit-llvm -o foo.bc -c foo.c
./reader foo.bc

The output should be something like the following:

 Name: foo.bc
 Target triple: x86_64-unknown-linux-gnu
 Function: my_func
  BasicBlock: entry
   Instruction 0x18deb68 : alloca ptr0x18db940,
   Instruction 0x18debe8 : alloca ptr0x18db940,
   Instruction 0x18dec68 : alloca ptr0x18db940,
   Instruction 0x18dece8 : alloca ptr0x18db940,
   Instruction 0x18de968 : getelementptr coerce, ptr0x18de880, ptr0x18de880,
   Instruction 0x18de9f0 : store obj_a.coerce0, ptr0x18de968,
   Instruction 0x18df0a8 : getelementptr coerce, ptr0x18de880, ptr0x18db940,
   Instruction 0x18df130 : store obj_a.coerce1, ptr0x18df0a8,
   Instruction 0x18df1a8 : bitcast obj_a,
   Instruction 0x18df218 : bitcast coerce,
   Instruction 0x18df300 : call ptr0x18df1a8, ptr0x18df218, ptr0x18de8d0, ptr0x18de1a0, ptr0x18de1f0, llvm.memcpy.p0i8.p0i8.i64,
   Instruction 0x18df3a0 : store arg1, arg1.addr,
   Instruction 0x18df418 : load arg1.addr,
   Instruction 0x18df4a0 : store ptr0x18df418, x,
   Instruction 0x18df518 : load x,
   Instruction 0x18df5a0 : add ptr0x18df518, ptr0x18db940,
   Instruction 0x18df648 : getelementptr obj_a, ptr0x18de880, ptr0x18deab0,
   Instruction 0x18df6b8 : load f_c,
   Instruction 0x18df740 : shl bf.load, ptr0x18deb00,
   Instruction 0x18df7d0 : ashr bf.shl, ptr0x18deb00,
   Instruction 0x18df848 : sext bf.ashr,
   Instruction 0x18df8d0 : add add, conv,
   Instruction 0x18df948 : ret add1,
 Function: llvm.memcpy.p0i8.p0i8.i64
 Function: main
  BasicBlock: entry
   Instruction 0x18e0078 : alloca ptr0x18db940,
   Instruction 0x18e00f8 : alloca ptr0x18db940,
   Instruction 0x18e0178 : alloca ptr0x18db940,
   Instruction 0x18e01f8 : alloca ptr0x18db940,
   Instruction 0x18e0278 : alloca ptr0x18db940,
   Instruction 0x18e02f8 : alloca ptr0x18db940,
   Instruction 0x18e0378 : alloca ptr0x18db940,
   Instruction 0x18e0410 : store ptr0x18de880, retval,
   Instruction 0x18e04a0 : store ptr0x18dfe30, a,
   Instruction 0x18e0530 : store ptr0x18dfe80, b,
   Instruction 0x18e05c0 : store ptr0x18dfed0, c,
   Instruction 0x18e0650 : store ptr0x18dff20, d,
   Instruction 0x18e06f8 : getelementptr obj_a, ptr0x18de880, ptr0x18de880,
   Instruction 0x18e0780 : store ptr0x18db940, f_a,
   Instruction 0x18e0828 : getelementptr obj_a, ptr0x18de880, ptr0x18db940,
   Instruction 0x18e08b0 : store ptr0x18deab0, f_b,
   Instruction 0x18e0958 : getelementptr obj_a, ptr0x18de880, ptr0x18deab0,
   Instruction 0x18e09c8 : load f_c,
   Instruction 0x18e0a50 : and bf.load, ptr0x18dff70,
   Instruction 0x18e0ae0 : or bf.clear, ptr0x18deb00,
   Instruction 0x18e0b70 : store bf.set, f_c,
   Instruction 0x18e0c18 : getelementptr obj_a, ptr0x18de880, ptr0x18deab0,
   Instruction 0x18e0c88 : load f_c1,
   Instruction 0x18e0d10 : and bf.load2, ptr0x18dff70,
   Instruction 0x18e0da0 : or bf.clear3, ptr0x18dffc0,
   Instruction 0x18ded80 : store bf.set4, f_c1,
   Instruction 0x18dedf8 : load a,
   Instruction 0x18dee80 : icmp ptr0x18dedf8, ptr0x18e0010,
   Instruction 0x18def28 : br cmp, if.else, if.then,
  BasicBlock: if.then
   Instruction 0x18def98 : load c,
   Instruction 0x18e1440 : store ptr0x18def98, b,
   Instruction 0x18df008 : br if.end,
  BasicBlock: if.else
   Instruction 0x18e14b8 : load d,
   Instruction 0x18e1528 : bitcast obj_a.coerce,
   Instruction 0x18e1598 : bitcast obj_a,
   Instruction 0x18e1680 : call ptr0x18e1528, ptr0x18e1598, ptr0x18de8d0, ptr0x18de880, ptr0x18de1f0, llvm.memcpy.p0i8.p0i8.i64,
   Instruction 0x18e1738 : getelementptr obj_a.coerce, ptr0x18de880, ptr0x18de880,
   Instruction 0x18e17a8 : load ptr0x18e1738,
   Instruction 0x18e1848 : getelementptr obj_a.coerce, ptr0x18de880, ptr0x18db940,
   Instruction 0x18e18b8 : load ptr0x18e1848,
   Instruction 0x18e1970 : call ptr0x18e14b8, ptr0x18e17a8, ptr0x18e18b8, my_func,
   Instruction 0x18e1a10 : store call, b,
   Instruction 0x18e1a88 : br if.end,
  BasicBlock: if.end
   Instruction 0x18e1af8 : load b,
   Instruction 0x18e1b68 : ret ptr0x18e1af8,

Explanation

To get a better idea about the above output, please note that.

LLVM use instruction address as the return value id

Internally, For each LLVM instructions, LLVM will directly use its instruction's address to represent the return value. and when the return value is used for another instruction, it will directly use the address of that instruction.

For the human readable IR generated by clang, the return value, such as %0,%add, %conv is generated by the LLVM IR writing for easy reading only.

LLVM Instruction class does not have LLVM IR file line number info

LLVM IR only have line number information about the original C source code. That means we could not get the idea about the line number for each operation in the LLVM IR code.

Therefore, although we could parse the operation line by line, we could not know which line the operation located at.

Reference

The above source code is borrowed from How to write a custom intermodular pass in LLVM?, and also modified for this question.

like image 75
Kun Ling Avatar answered Oct 14 '22 18:10

Kun Ling