2016年3月21日 星期一

Ignition - The Interpreter in V8 Javascript Engine

Mentioned in my previous article, V8 started to use a new interpreter, Ignition, to replace baseline compiler on some low-memory devices. Let's check it out!

Interpreters are notorious for its performance bottleneck, and they're also proved to be slower than most of the JIT runtimes. I've explained some of the bottlenecks of interpreters and background knowledge in one of my blog post(section 2 and 3). So I think folks in Chromium team must spend lots of efforts on Ignition before bringing it rivaling performance with the old JIT solution.

There has already a document written by Chromium team about Ignition. To summarize its content, Ignition uses a streaming execution flow which executes one byte code handler after another. That is, a byte code handler function would recursively invoke the dispatcher at the end of its function body. Of course, tail call optimization is required to avoid smashing the stack. The following figure illustrates overview of the execution flow.
So let's start our code journey. Everything starts from (src/interpreter/interpreter.cc) Interpreter::Initialize(). We encounter a strange macro immediately: GENERATE_CODE(Name, ...).
#define GENERATE_CODE(Name, ...)                                        \
  {                                                                     \
    InterpreterAssembler assembler(isolate_, &zone, Bytecode::k##Name); \
    Do##Name(&assembler);                                               \
    Handle<Code> code = assembler.GenerateCode();                       \
    dispatch_table_[Bytecodes::ToByte(Bytecode::k##Name)] = *code;      \
    TraceCodegen(code);                                                 \
    LOG_CODE_EVENT(isolate_,                                            \
                   CodeCreateEvent(Logger::BYTECODE_HANDLER_TAG,        \
                                   AbstractCode::cast(*code), #Name));  \
  }
  BYTECODE_LIST(GENERATE_CODE)
#undef GENERATE_CODE

It is undefined right away, so we can inferred that it is only used by BYTECODE_LIST. As its name, BYTECODE_LIST declares all of the byte codes. And GENERATE_CODE here just help it defining their handlers and some dispatch information. The Do##Name(&assembler) statement seems to has something to do with byte code handlers, let's grep one of them and see what's going on.
// Mov <src> <dst>
//
// Stores the value of register <src> to register <dst>.
void Interpreter::DoMov(InterpreterAssembler* assembler) {
  Node* src_index = __ BytecodeOperandReg(0);
  Node* src_value = __ LoadRegister(src_index);
  Node* dst_index = __ BytecodeOperandReg(1);
  __ StoreRegister(src_value, dst_index);
  __ Dispatch();
}

This is the byte code handler for mov. It looks just like a normal byte code handler, right? Then have a closer look: Do##Name(&assembler) is a function call instead of a callback object we usually seen in many interpreters. In another word, Do##Name(&assembler) acts more like a routine responds for configuring byte code actions. After scrolling to the top of interpreter.cc, you can find that the inconspicuous "__" above turns out to be pretty important.
#define __ assembler->

InterpreterAssembler is respond for encapsulating a small group of assembly lines as a byte code handler, and also, abstracting the underlying platform to provide a uniform assembly code emitting interface here. What's more, we can witness more power on this kind of design in comparison with normal byte-code-handler-function fashions.

Next, we'll focus on the InterpreterAssembler::Dispatch() function. It will be delegated to several routines.
void InterpreterAssembler::Dispatch() {
  DispatchTo(Advance(Bytecodes::Size(bytecode_)));
}

void InterpreterAssembler::DispatchTo(Node* new_bytecode_offset) {
  Node* target_bytecode = Load(
      MachineType::Uint8(), BytecodeArrayTaggedPointer(), new_bytecode_offset);
  if (kPointerSize == 8) {
    target_bytecode = ChangeUint32ToUint64(target_bytecode);
  }

  // TODO(rmcilroy): Create a code target dispatch table to avoid conversion
  // from code object on every dispatch.
  Node* target_code_object =
      Load(MachineType::Pointer(), DispatchTableRawPointer(),
           WordShl(target_bytecode, IntPtrConstant(kPointerSizeLog2)));

  DispatchToBytecodeHandler(target_code_object, new_bytecode_offset);
}

void InterpreterAssembler::DispatchToBytecodeHandler(Node* handler,
                                                     Node* bytecode_offset) {
  if (FLAG_trace_ignition) {
    TraceBytecode(Runtime::kInterpreterTraceBytecodeExit);
  }

  InterpreterDispatchDescriptor descriptor(isolate());
  Node* args[] = {GetAccumulator(),          RegisterFileRawPointer(),
                  bytecode_offset,           BytecodeArrayTaggedPointer(),
                  DispatchTableRawPointer(), GetContext()};
  TailCall(descriptor, handler, args, 0);
}

The Advance() call in Dispatch() is clearly to be stepping to next byte code, and Load() call in DispatchTo() is responds for loading byte code object. Everything looks normal, except the TailCall() at the bottom of DispatchToBytecodeHandler().

As mentioned in the above sections, Ignition invokes byte code handlers
one after another in conjunction with dispatchers. Another key factor is that tail call optimization is required for this concatenation approach or stack overflow may occur. TailCall() in the above snippet reside in CodeStubAssembler class, which is also the parent class of InterpreterAssembler, it will then delegate the flow to RawMachineAssembler::TailCallN().
Node* RawMachineAssembler::TailCallN(CallDescriptor* desc, Node* function,
                                     Node** args) {
  int param_count =
      static_cast<int>(desc->GetMachineSignature()->parameter_count());
  int input_count = param_count + 1;
  Node** buffer = zone()->NewArray<Node*>(input_count);
  int index = 0;
  buffer[index++] = function;
  for (int i = 0; i < param_count; i++) {
    buffer[index++] = args[i];
  }
  Node* tail_call = MakeNode(common()->TailCall(desc), input_count, buffer);
  NodeProperties::MergeControlToEnd(graph(), common(), tail_call);
  schedule()->AddTailCall(CurrentBlock(), tail_call);
  current_block_ = nullptr;
  return tail_call;
}
There are many V8's abstraction types like Node and CallDescriptor in the above code, let's just not dig into those objects first. What we can infer is that a Node object seems to represent a piece of execution code and TailCallN() creates a special "tail call node" that is appended at the bottom of current execution block.

This approach sounds reasonable, and it also appeal the pros we mentioned before on the adoption of InterpreterAssembler rather than byte code handler functions. If we use byte code handler functions as handlers and also want to have the same cascading execution fashion, tail call optimization would all be depending on compiler. I'm not judging that modern compilers can't do this job well, but we want more explicit controls on the behavior of our interpreter. In contrast, the InterpreterAssembler abstracts handlers into code pieces which can be freely arranged in any place in the execution flow. So getting the same effect of tail call optimization is just a piece of code(cake)

Last but not the least, we want to know more about the Node class. It turn our that Node is just an interface representing a node in a graph. There are lots of implementation in the entire V8 codebase. Like AstGraph, BytecodeGraph and RawMachineAssembler in our case. 

There are two common traits in both Ignition and Android ART's interpreter mentioned in my previous article

  1. They call byte code handlers and dispatchers(byte code decoders) one after another without returning to the parent function or using loop statement to iterate though byte codes. 
  2. They directly emit assembly code as the main execution core for both performance and design concern(like reaching tail call optimization). 

In my opinion, both of the traits above represent the latest modern interpreter design. We can use them in many projects as long as they have similar interpreting execution flow. 

2016年3月16日 星期三

Interpreter in Android ART

Previously, we discussed about the new JIT execution engine in ART debuted in MarshMellow codebase. At the end of that article, I considered that the design of the interpreter is more important than the JIT compiler since interpreter is often the performance bottleneck. Thus, ART's interpreter is the very thing we're going to talk about in this article.

Over the decades, people who are working on interpreter has one conclusion: Interpreting would never be faster than compiled code execution. ART's interpreter  is not an exception either. But you don't need to be too sad: It's just a tradeoff among performance and other factors like cross-platform. There is no silver bullet at this time. One of the notorious key factors degrading the performance of interpreter is CPU's pipeline stalling. (Feel free to skip the following two sections if you're familiar with pipeline stalling)

CPU's pipeline needs to be as full as it can to reach peak performance and maximum instruction parallelism. But branch operations(e.g. if, switch) would interrupt this model, because before evaluating the condition expressions(e.g. expression between parentheses of if(...)), you can't fill any instructions into pipeline since CPU doesn't know which branch to go. Thus, CPU would try to "guess" which branch to execute and load instructions of that branch ahead before the evaluation of condition expressions. The reason is that if CPU "guess wrong", the cost of reloading correct instructions is same as the cost to wait for completeness of condition expressions(i.e. "stalling" of the pipeline). This techniques is called branch prediction.

However, there are a few things that would make branch prediction even harder: indirect addressing and large amount of condition branches to name a few. Indirect addressing is used in various of applications. For example, indirect function call, which the address of invoked function is stored in a pointer variable. Indirect addressing had been able to be improved in modern CPUs.  What we're interested is the second issue, large condition branches. Most of the branch prediction techniques use history-based profiling. In short, they  record the amount of appearance of each branch and perform prediction based on those statistics. Thus, too many condition branches, for example, large switch block, may not perform well in branch prediction since the branch history storage is limited, large amount of condition branches may frequently flush the storage.

According to the above explanation, interpreter which uses a large switch to decode op codes is not very efficient. However, the switch implementation is simpler to build and may have chance to do some higher level task in comparison with the assembly approach introduce later. Thus, many famous projects still use this approach. For example, Firefox's javascript engine, Spider Monkey, used this sort of techniques in their interpreter few years ago. ART's interpreter in our case also use switch implementation as a fallback which we would mention in the following sections.

So let's start our journey into ART's interpreter from Execute() in runtime/interpreter/interpreter.cc. First, it would check that whether there exists a JIT compiled version of this method if we're at the beginning of a it.
  if (LIKELY(shadow_frame.GetDexPC() == 0)) {  // Entering the method, but not via deoptimization.
    if (kIsDebugBuild) {
      self->AssertNoPendingException();
    }
    instrumentation::Instrumentation* instrumentation = Runtime::Current()->GetInstrumentation();
    ArtMethod *method = shadow_frame.GetMethod();

    if (UNLIKELY(instrumentation->HasMethodEntryListeners())) {
      instrumentation->MethodEnterEvent(self, shadow_frame.GetThisObject(code_item->ins_size_),
                                        method, 0);
    }

    jit::Jit* jit = Runtime::Current()->GetJit();
    if (jit != nullptr && jit->CanInvokeCompiledCode(method)) {
      JValue result;

      // Pop the shadow frame before calling into compiled code.
      self->PopShadowFrame();
      ArtInterpreterToCompiledCodeBridge(self, code_item, &shadow_frame, &result);
      // Push the shadow frame back as the caller will expect it.
      self->PushShadowFrame(&shadow_frame);

      return result;
    }
  }

Then, it would select the proper interpreter implementation.
    if (kInterpreterImplKind == kMterpImplKind) {
      if (transaction_active) {
        // No Mterp variant - just use the switch interpreter.
        return ExecuteSwitchImpl<false, true>(self, code_item, shadow_frame, result_register,
                                              false);
      } else if (UNLIKELY(!Runtime::Current()->IsStarted())) {
        return ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame, result_register,
                                               false);
      } else {
        while (true) {
          // Mterp does not support all instrumentation/debugging.
          if (MterpShouldSwitchInterpreters()) {
            return ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame, result_register,
                                                   false);
          }
          bool returned = ExecuteMterpImpl(self, code_item, &shadow_frame, &result_register);
          if (returned) {
            return result_register;
          } else {
            // Mterp didn't like that instruction.  Single-step it with the reference interpreter.
            result_register = ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame,
                                                               result_register, true);
            if (shadow_frame.GetDexPC() == DexFile::kDexNoIndex) {
              // Single-stepped a return or an exception not handled locally.  Return to caller.
              return result_register;
            }
          }
        }
      }
    } else if (kInterpreterImplKind == kSwitchImplKind) {

The target interpreter we're going to research is called Mterp, which probably stands for "Macro inTerpreter". It use hand written assembly code to decode and dispatch op code handling efficiently. It seems cool but actually this is the exact way how Dalvik interprets dex byte code, it just take some time to port Dalvik's interpreter to ART.

In addition to kInterpreterImplKind variable, there are other chances that would make interpreter fallback to old implementation like switch or goto. The first is MterpShouldSwitchInterpreters() in line 12 , but it is actually more like a debug sugar. In runtime/interpreter/mterp/mterp.cc:
extern "C" bool MterpShouldSwitchInterpreters()
    SHARED_REQUIRES(Locks::mutator_lock_) {
  const instrumentation::Instrumentation* const instrumentation =
      Runtime::Current()->GetInstrumentation();
  return instrumentation->NonJitProfilingActive() || Dbg::IsDebuggerActive();
}

The main entry point of Mterp is ExecuteMterpImpl(). We'll postpone analyzing that function and see the next few lines of code first. The if statement take return value of ExecuteMterpImpl() as condition expression ensures that if Mterp can't handle an instruction, there is a fallback solution since as the comment says, Mterp can't handle all of the instrumentation or debugging. But the question is: why there is a infinite while loop around these code? It turn out that ExecuteMterpImpl() would execute the whole method where ExecuteSwitchImpl() would only execute one instruction a time("single-stepped").

So, where is ExecuteMterpImpl() ? How does it use assembly code to boost the interpreting?

When it comes to assembly code, it means that this part of code is architecture dependent, which lives in separated folders under runtime/interpreter/mterp. Let's take arm for example. Under arm folder, there are lots of files which file name starts with "op_" contain only one line, indicating one instruction. The fact is that these scattered files need to be combined into a single assembly code: mterp_arm.S and generate into out folder by a script, rebuild.sh. There are also some files which names starts with "config-" allow one to configure this code generating process. Finally, in mterp_arm.S, we found our lovely ExecuteMterpImpl
.text
    .align  2
    .global ExecuteMterpImpl
    .type   ExecuteMterpImpl, %function

/*
 * On entry:
 *  r0  Thread* self/
 *  r1  code_item
 *  r2  ShadowFrame
 *  r3  JValue* result_register
 *
 */

ExecuteMterpImpl:
    .fnstart
    .save {r4-r10,fp,lr}
    stmfd   sp!, {r4-r10,fp,lr}         @ save 9 regs
    .pad    #4
    sub     sp, sp, #4                  @ align 64

    /* Remember the return register */
    str     r3, [r2, #SHADOWFRAME_RESULT_REGISTER_OFFSET]

    /* Remember the code_item */
    str     r1, [r2, #SHADOWFRAME_CODE_ITEM_OFFSET]

    /* set up "named" registers */
    mov     rSELF, r0
    ldr     r0, [r2, #SHADOWFRAME_NUMBER_OF_VREGS_OFFSET]
    add     rFP, r2, #SHADOWFRAME_VREGS_OFFSET     @ point to vregs.
    VREG_INDEX_TO_ADDR rREFS, r0 @ point to reference array in shadow frame
    ldr     r0, [r2, #SHADOWFRAME_DEX_PC_OFFSET]   @ Get starting dex_pc.
    add     rPC, r1, #CODEITEM_INSNS_OFFSET     @ Point to base of insns[]
    add     rPC, rPC, r0, lsl #1 @ Create direct pointer to 1st dex opcode
    EXPORT_PC

    /* Starting ibase */
    ldr     rIBASE, [rSELF, #THREAD_CURRENT_IBASE_OFFSET]

    /* start executing the instruction at rPC */
    FETCH_INST                          @ load rINST from rPC
    GET_INST_OPCODE ip                  @ extract opcode from rINST
    GOTO_OPCODE ip                      @ jump to next instruction
    /* NOTE: no fallthrough */

The above is just part of the code, and let's take a look at one of the operations, op_mov.
/* ------------------------------ */
    .balign 128
.L_op_move: /* 0x01 */
/* File: arm/op_move.S */
    /* for move, move-object, long-to-int */
    /* op vA, vB */
    mov     r1, rINST, lsr #12          @ r1<- B from 15:12
    ubfx    r0, rINST, #8, #4           @ r0<- A from 11:8
    FETCH_ADVANCE_INST 1                @ advance rPC, load rINST
    GET_VREG r2, r1                     @ r2<- fp[B]
    GET_INST_OPCODE ip                  @ ip<- opcode from rINST
    .if 0
    SET_VREG_OBJECT r2, r0              @ fp[A]<- r2
    .else
    SET_VREG r2, r0                     @ fp[A]<- r2
    .endif
    GOTO_OPCODE ip                      @ execute next instruction

/* ------------------------------ */

Now we have a more clear picture of the execution flow: ExecuteMterpImpl is the entry of one interpret unit, method in this case. It would use GOTO_OPCODE macro to query and execute the first op code. At the end of each op execution handler, it would fetch next op code, query and starts another op execution. The query procedure is documented in README.txt under mterp folder, in short, every op handler's entry point is located at the base of handler table + (opcode * 128), that is, shift left 7 bits. This approach create a linear, stream execution and of course, contains nearly no branch conditions. By the way, although it's not common to see, there are actually lots of places in AOSP use hand-written assembly to boost the execution performance, for example, the memory allocator for ART.

ART's interpreter try really hard to avoid branch execution and pipeline stalling. Although this techniques had already appeared in Dalvik, I think this sort of interpreter design is probably the fastest at this time.


2016年3月9日 星期三

JIT In Android ART

Interestingly, although Android claim that it's latest Runtime(ART) adopts Ahead-Of-Time(AOT) compiling, a jit folder was silently shipped into art/compiler folder within AOSP around the early era of Marshmallow.

Fact is that: the installation procedure takes a long time on some of the devices running on ART. E.g. Facebook sometimes takes 2 minutes to install! Perhaps that's the reason why Android want to move back to JIT.

Projects usually use interpreter along with JIT engine. That is, interpreting the code or byte code first and collecting the profile information including how often a method is executed, aka. how "hot" the method is, and type information if you're working with dynamic type language. After several turns, if a method is "hot" enough, the execution engine would use the JIT compiler to compile the code into native code and delegate the execution to the native compiled method in every invoking of that method afterward. E.g. Dalvik VM's JIT engine.

Nevertheless, there are also some projects don't use interpreter, but instead using an extremely fast compiler to compile each method executes next ahead before using another optimizing compiler to do more optimized compiling on those "hot" methods. E.g. Google V8 javascript engine.

The latter approach is usually faster, but to my surprise, the new ART JIT adopts the first, the interpreter combo.

The great journey of ART's JIT starts from art/runtime/jit/jit_instrumentation.cc. Instrumentation in ART acts like a listener listens for various of interpreting or compilation events. E.g. methods invoking, branches and OSR(On Stack Replacement). JitInstrumentationCache::CreateThreadPool() adds the JitInstrumentationListener instance to the runtime instrumentation set.

JitInstrumentationListener listens to three events: method entered(JitInstrumentationListener::MethodEntered()), branches(JitInstrumentationListener::Branch()) and virtual or interface method invoking(JitInstrumentationListener::InvokeVirtualOrInterface()). The compilation triggers, instrumentation_cache_->AddSamples(...), reside within method entered and branches callbacks

JitInstrumentationCache::AddSamples() shows that ART JIT uses a slightly modified counter approach to profile execution flows. Usually, JIT compiler simply set a counter threshold and trigger compilation task after exceeding that value. But there seems to be THREE counter thresholds in this case: warm_method_threshold_, hot_method_threshold_ and osr_method_threshold_. Constructing a JIT system with more levels. The values are passed from the JVM arguments(JVM is an interface, not an unique instance, ART is one of the implementations) but I can't find those arguments at this time. But from the code arrangement we can inferred that warm_method_threshold < hot_method_threshold < osr_method_threshold. I'm also wondering how osr_method_threshold woks.

If one of the thresholds is reached, it would arrange a JitCompileTask. The following flow is pretty interesting:  Jit::CompileMethod() would be invoked, but Jit::Compile() is actually a stub of jit_compile_method(). What's special about jit_compile_method()? It's a C symbol loaded from dynamic library libart-compiler.so.  libart-compiler.so has nothing special, it's source files live side by side with source files mentioned above, I think modularization is the main reason why they adopt this kind of ad-hoc approach.

After going into jit_compile_method()OptimizingCompiler::TryCompile() would be called. Few months ago, there are two compilation levels in TryCompile(): CompileBaseline and CompileOptimized. But now, those levels is replaced by a neater, single level  approach:
  // Try compiling a method and return the code generator used for
  // compiling it.
  // This method:
  // 1) Builds the graph. Returns null if it failed to build it.
  // 2) Transforms the graph to SSA. Returns null if it failed.
  // 3) Runs optimizations on the graph, including register allocator.
  // 4) Generates code with the `code_allocator` provided.
  CodeGenerator* TryCompile(ArenaAllocator* arena,
                            CodeVectorAllocator* code_allocator,
                            const DexFile::CodeItem* code_item,
                            uint32_t access_flags,
                            InvokeType invoke_type,
                            uint16_t class_def_idx,
                            uint32_t method_idx,
                            jobject class_loader,
                            const DexFile& dex_file,
                            Handle<mirror::DexCache> dex_cache,
                            bool osr) const;

The comments had explained almost everything. The graph is an instance of HGraph class, which is easy to perform various of compiler optimizations. ART JIT use a method-based JIT compiler in contrast with the old DalvikVM JIT, which use trace-based compiler and switch to method-based compiler only under device charging.

In summary, JIT in ART doesn't seem to use any special techniques, so in my opinion, the key of performance falls on the interpreter, I would take some time researching on that part.