2016年3月16日 星期三

Interpreter in Android ART

Previously, we discussed about the new JIT execution engine in ART debuted in MarshMellow codebase. At the end of that article, I considered that the design of the interpreter is more important than the JIT compiler since interpreter is often the performance bottleneck. Thus, ART's interpreter is the very thing we're going to talk about in this article.

Over the decades, people who are working on interpreter has one conclusion: Interpreting would never be faster than compiled code execution. ART's interpreter  is not an exception either. But you don't need to be too sad: It's just a tradeoff among performance and other factors like cross-platform. There is no silver bullet at this time. One of the notorious key factors degrading the performance of interpreter is CPU's pipeline stalling. (Feel free to skip the following two sections if you're familiar with pipeline stalling)

CPU's pipeline needs to be as full as it can to reach peak performance and maximum instruction parallelism. But branch operations(e.g. if, switch) would interrupt this model, because before evaluating the condition expressions(e.g. expression between parentheses of if(...)), you can't fill any instructions into pipeline since CPU doesn't know which branch to go. Thus, CPU would try to "guess" which branch to execute and load instructions of that branch ahead before the evaluation of condition expressions. The reason is that if CPU "guess wrong", the cost of reloading correct instructions is same as the cost to wait for completeness of condition expressions(i.e. "stalling" of the pipeline). This techniques is called branch prediction.

However, there are a few things that would make branch prediction even harder: indirect addressing and large amount of condition branches to name a few. Indirect addressing is used in various of applications. For example, indirect function call, which the address of invoked function is stored in a pointer variable. Indirect addressing had been able to be improved in modern CPUs.  What we're interested is the second issue, large condition branches. Most of the branch prediction techniques use history-based profiling. In short, they  record the amount of appearance of each branch and perform prediction based on those statistics. Thus, too many condition branches, for example, large switch block, may not perform well in branch prediction since the branch history storage is limited, large amount of condition branches may frequently flush the storage.

According to the above explanation, interpreter which uses a large switch to decode op codes is not very efficient. However, the switch implementation is simpler to build and may have chance to do some higher level task in comparison with the assembly approach introduce later. Thus, many famous projects still use this approach. For example, Firefox's javascript engine, Spider Monkey, used this sort of techniques in their interpreter few years ago. ART's interpreter in our case also use switch implementation as a fallback which we would mention in the following sections.

So let's start our journey into ART's interpreter from Execute() in runtime/interpreter/interpreter.cc. First, it would check that whether there exists a JIT compiled version of this method if we're at the beginning of a it.
  if (LIKELY(shadow_frame.GetDexPC() == 0)) {  // Entering the method, but not via deoptimization.
    if (kIsDebugBuild) {
      self->AssertNoPendingException();
    }
    instrumentation::Instrumentation* instrumentation = Runtime::Current()->GetInstrumentation();
    ArtMethod *method = shadow_frame.GetMethod();

    if (UNLIKELY(instrumentation->HasMethodEntryListeners())) {
      instrumentation->MethodEnterEvent(self, shadow_frame.GetThisObject(code_item->ins_size_),
                                        method, 0);
    }

    jit::Jit* jit = Runtime::Current()->GetJit();
    if (jit != nullptr && jit->CanInvokeCompiledCode(method)) {
      JValue result;

      // Pop the shadow frame before calling into compiled code.
      self->PopShadowFrame();
      ArtInterpreterToCompiledCodeBridge(self, code_item, &shadow_frame, &result);
      // Push the shadow frame back as the caller will expect it.
      self->PushShadowFrame(&shadow_frame);

      return result;
    }
  }

Then, it would select the proper interpreter implementation.
    if (kInterpreterImplKind == kMterpImplKind) {
      if (transaction_active) {
        // No Mterp variant - just use the switch interpreter.
        return ExecuteSwitchImpl<false, true>(self, code_item, shadow_frame, result_register,
                                              false);
      } else if (UNLIKELY(!Runtime::Current()->IsStarted())) {
        return ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame, result_register,
                                               false);
      } else {
        while (true) {
          // Mterp does not support all instrumentation/debugging.
          if (MterpShouldSwitchInterpreters()) {
            return ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame, result_register,
                                                   false);
          }
          bool returned = ExecuteMterpImpl(self, code_item, &shadow_frame, &result_register);
          if (returned) {
            return result_register;
          } else {
            // Mterp didn't like that instruction.  Single-step it with the reference interpreter.
            result_register = ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame,
                                                               result_register, true);
            if (shadow_frame.GetDexPC() == DexFile::kDexNoIndex) {
              // Single-stepped a return or an exception not handled locally.  Return to caller.
              return result_register;
            }
          }
        }
      }
    } else if (kInterpreterImplKind == kSwitchImplKind) {

The target interpreter we're going to research is called Mterp, which probably stands for "Macro inTerpreter". It use hand written assembly code to decode and dispatch op code handling efficiently. It seems cool but actually this is the exact way how Dalvik interprets dex byte code, it just take some time to port Dalvik's interpreter to ART.

In addition to kInterpreterImplKind variable, there are other chances that would make interpreter fallback to old implementation like switch or goto. The first is MterpShouldSwitchInterpreters() in line 12 , but it is actually more like a debug sugar. In runtime/interpreter/mterp/mterp.cc:
extern "C" bool MterpShouldSwitchInterpreters()
    SHARED_REQUIRES(Locks::mutator_lock_) {
  const instrumentation::Instrumentation* const instrumentation =
      Runtime::Current()->GetInstrumentation();
  return instrumentation->NonJitProfilingActive() || Dbg::IsDebuggerActive();
}

The main entry point of Mterp is ExecuteMterpImpl(). We'll postpone analyzing that function and see the next few lines of code first. The if statement take return value of ExecuteMterpImpl() as condition expression ensures that if Mterp can't handle an instruction, there is a fallback solution since as the comment says, Mterp can't handle all of the instrumentation or debugging. But the question is: why there is a infinite while loop around these code? It turn out that ExecuteMterpImpl() would execute the whole method where ExecuteSwitchImpl() would only execute one instruction a time("single-stepped").

So, where is ExecuteMterpImpl() ? How does it use assembly code to boost the interpreting?

When it comes to assembly code, it means that this part of code is architecture dependent, which lives in separated folders under runtime/interpreter/mterp. Let's take arm for example. Under arm folder, there are lots of files which file name starts with "op_" contain only one line, indicating one instruction. The fact is that these scattered files need to be combined into a single assembly code: mterp_arm.S and generate into out folder by a script, rebuild.sh. There are also some files which names starts with "config-" allow one to configure this code generating process. Finally, in mterp_arm.S, we found our lovely ExecuteMterpImpl
.text
    .align  2
    .global ExecuteMterpImpl
    .type   ExecuteMterpImpl, %function

/*
 * On entry:
 *  r0  Thread* self/
 *  r1  code_item
 *  r2  ShadowFrame
 *  r3  JValue* result_register
 *
 */

ExecuteMterpImpl:
    .fnstart
    .save {r4-r10,fp,lr}
    stmfd   sp!, {r4-r10,fp,lr}         @ save 9 regs
    .pad    #4
    sub     sp, sp, #4                  @ align 64

    /* Remember the return register */
    str     r3, [r2, #SHADOWFRAME_RESULT_REGISTER_OFFSET]

    /* Remember the code_item */
    str     r1, [r2, #SHADOWFRAME_CODE_ITEM_OFFSET]

    /* set up "named" registers */
    mov     rSELF, r0
    ldr     r0, [r2, #SHADOWFRAME_NUMBER_OF_VREGS_OFFSET]
    add     rFP, r2, #SHADOWFRAME_VREGS_OFFSET     @ point to vregs.
    VREG_INDEX_TO_ADDR rREFS, r0 @ point to reference array in shadow frame
    ldr     r0, [r2, #SHADOWFRAME_DEX_PC_OFFSET]   @ Get starting dex_pc.
    add     rPC, r1, #CODEITEM_INSNS_OFFSET     @ Point to base of insns[]
    add     rPC, rPC, r0, lsl #1 @ Create direct pointer to 1st dex opcode
    EXPORT_PC

    /* Starting ibase */
    ldr     rIBASE, [rSELF, #THREAD_CURRENT_IBASE_OFFSET]

    /* start executing the instruction at rPC */
    FETCH_INST                          @ load rINST from rPC
    GET_INST_OPCODE ip                  @ extract opcode from rINST
    GOTO_OPCODE ip                      @ jump to next instruction
    /* NOTE: no fallthrough */

The above is just part of the code, and let's take a look at one of the operations, op_mov.
/* ------------------------------ */
    .balign 128
.L_op_move: /* 0x01 */
/* File: arm/op_move.S */
    /* for move, move-object, long-to-int */
    /* op vA, vB */
    mov     r1, rINST, lsr #12          @ r1<- B from 15:12
    ubfx    r0, rINST, #8, #4           @ r0<- A from 11:8
    FETCH_ADVANCE_INST 1                @ advance rPC, load rINST
    GET_VREG r2, r1                     @ r2<- fp[B]
    GET_INST_OPCODE ip                  @ ip<- opcode from rINST
    .if 0
    SET_VREG_OBJECT r2, r0              @ fp[A]<- r2
    .else
    SET_VREG r2, r0                     @ fp[A]<- r2
    .endif
    GOTO_OPCODE ip                      @ execute next instruction

/* ------------------------------ */

Now we have a more clear picture of the execution flow: ExecuteMterpImpl is the entry of one interpret unit, method in this case. It would use GOTO_OPCODE macro to query and execute the first op code. At the end of each op execution handler, it would fetch next op code, query and starts another op execution. The query procedure is documented in README.txt under mterp folder, in short, every op handler's entry point is located at the base of handler table + (opcode * 128), that is, shift left 7 bits. This approach create a linear, stream execution and of course, contains nearly no branch conditions. By the way, although it's not common to see, there are actually lots of places in AOSP use hand-written assembly to boost the execution performance, for example, the memory allocator for ART.

ART's interpreter try really hard to avoid branch execution and pipeline stalling. Although this techniques had already appeared in Dalvik, I think this sort of interpreter design is probably the fastest at this time.


沒有留言:

張貼留言