Previously, we discussed about the new JIT execution engine in ART debuted in MarshMellow codebase. At the end of that article, I considered that the design of the interpreter is more important than the JIT compiler since interpreter is often the performance bottleneck. Thus, ART's interpreter is the very thing we're going to talk about in this article.
Over the decades, people who are working on interpreter has one conclusion:
Interpreting would never be faster than compiled code execution. ART's interpreter is not an exception either. But you don't need to be too sad: It's just a tradeoff among performance and other factors like cross-platform. There is no silver bullet at this time. One of the notorious key factors degrading the performance of interpreter is CPU's pipeline stalling. (Feel free to skip the following two sections if you're familiar with pipeline stalling)
CPU's pipeline needs to be as full as it can to reach peak performance and maximum instruction parallelism. But branch operations(e.g.
if, switch) would interrupt this model, because before evaluating the condition expressions(e.g. expression between parentheses of
if(...)), you can't fill any instructions into pipeline since CPU doesn't know which branch to go. Thus, CPU would try to "guess" which branch to execute and load instructions of that branch ahead
before the evaluation of condition expressions. The reason is that if CPU "guess wrong", the cost of reloading correct instructions is same as the cost to wait for completeness of condition expressions(i.e. "stalling" of the pipeline). This techniques is called branch prediction.
However, there are a few things that would make branch prediction even harder: indirect addressing and large amount of condition branches to name a few. Indirect addressing is used in various of applications. For example, indirect function call, which the address of invoked function is stored in a pointer variable. Indirect addressing had been able to be improved in modern CPUs. What we're interested is the second issue, large condition branches. Most of the branch prediction techniques use history-based profiling. In short, they record the amount of appearance of each branch and perform prediction based on those statistics. Thus, too many condition branches, for example, large
switch block, may not perform well in branch prediction since the branch history storage is limited, large amount of condition branches may frequently flush the storage.
According to the above explanation, interpreter which uses a large
switch to decode op codes is not very efficient. However, the
switch implementation is simpler to build and may have chance to do some higher level task in comparison with the assembly approach introduce later. Thus, many famous projects still use this approach. For example, Firefox's javascript engine, Spider Monkey, used this sort of techniques in their interpreter few years ago. ART's interpreter in our case also use
switch implementation as a fallback which we would mention in the following sections.
So let's start our journey into ART's interpreter from
Execute() in runtime/interpreter/interpreter.cc. First, it would check that whether there exists a JIT compiled version of this method if we're at the beginning of a it.
if (LIKELY(shadow_frame.GetDexPC() == 0)) { // Entering the method, but not via deoptimization.
if (kIsDebugBuild) {
self->AssertNoPendingException();
}
instrumentation::Instrumentation* instrumentation = Runtime::Current()->GetInstrumentation();
ArtMethod *method = shadow_frame.GetMethod();
if (UNLIKELY(instrumentation->HasMethodEntryListeners())) {
instrumentation->MethodEnterEvent(self, shadow_frame.GetThisObject(code_item->ins_size_),
method, 0);
}
jit::Jit* jit = Runtime::Current()->GetJit();
if (jit != nullptr && jit->CanInvokeCompiledCode(method)) {
JValue result;
// Pop the shadow frame before calling into compiled code.
self->PopShadowFrame();
ArtInterpreterToCompiledCodeBridge(self, code_item, &shadow_frame, &result);
// Push the shadow frame back as the caller will expect it.
self->PushShadowFrame(&shadow_frame);
return result;
}
}
Then, it would select the proper interpreter implementation.
if (kInterpreterImplKind == kMterpImplKind) {
if (transaction_active) {
// No Mterp variant - just use the switch interpreter.
return ExecuteSwitchImpl<false, true>(self, code_item, shadow_frame, result_register,
false);
} else if (UNLIKELY(!Runtime::Current()->IsStarted())) {
return ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame, result_register,
false);
} else {
while (true) {
// Mterp does not support all instrumentation/debugging.
if (MterpShouldSwitchInterpreters()) {
return ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame, result_register,
false);
}
bool returned = ExecuteMterpImpl(self, code_item, &shadow_frame, &result_register);
if (returned) {
return result_register;
} else {
// Mterp didn't like that instruction. Single-step it with the reference interpreter.
result_register = ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame,
result_register, true);
if (shadow_frame.GetDexPC() == DexFile::kDexNoIndex) {
// Single-stepped a return or an exception not handled locally. Return to caller.
return result_register;
}
}
}
}
} else if (kInterpreterImplKind == kSwitchImplKind) {
The target interpreter we're going to research is called Mterp, which probably stands for "Macro inTerpreter". It use hand written assembly code to decode and dispatch op code handling efficiently. It seems cool but actually this is the exact way how Dalvik interprets dex byte code, it just take some time to port Dalvik's interpreter to ART.
In addition to
kInterpreterImplKind variable, there are other chances that would make interpreter fallback to old implementation like
switch or
goto. The first is
MterpShouldSwitchInterpreters() in line 12 , but it is actually more like a debug sugar. In runtime/interpreter/mterp/mterp.cc:
extern "C" bool MterpShouldSwitchInterpreters()
SHARED_REQUIRES(Locks::mutator_lock_) {
const instrumentation::Instrumentation* const instrumentation =
Runtime::Current()->GetInstrumentation();
return instrumentation->NonJitProfilingActive() || Dbg::IsDebuggerActive();
}
The main entry point of Mterp is
ExecuteMterpImpl(). We'll postpone analyzing that function and see the next few lines of code first. The
if statement take return value of
ExecuteMterpImpl() as condition expression ensures that if Mterp can't handle an instruction, there is a fallback solution since as the comment says, Mterp can't handle all of the instrumentation or debugging. But the question is: why there is a infinite while loop around these code? It turn out that
ExecuteMterpImpl() would execute the whole method where
ExecuteSwitchImpl() would only execute one instruction a time("single-stepped").
So, where is
ExecuteMterpImpl() ? How does it use assembly code to boost the interpreting?
When it comes to assembly code, it means that this part of code is architecture dependent, which lives in separated folders under runtime/interpreter/mterp. Let's take
arm for example. Under
arm folder, there are lots of files which file name starts with "op_" contain only one line, indicating one instruction. The fact is that these scattered files need to be combined into a single assembly code:
mterp_arm.S and generate into out folder by a script, rebuild.sh. There are also some files which names starts with "config-" allow one to configure this code generating process. Finally, in mterp_arm.S, we found our lovely ExecuteMterpImpl.
.text
.align 2
.global ExecuteMterpImpl
.type ExecuteMterpImpl, %function
/*
* On entry:
* r0 Thread* self/
* r1 code_item
* r2 ShadowFrame
* r3 JValue* result_register
*
*/
ExecuteMterpImpl:
.fnstart
.save {r4-r10,fp,lr}
stmfd sp!, {r4-r10,fp,lr} @ save 9 regs
.pad #4
sub sp, sp, #4 @ align 64
/* Remember the return register */
str r3, [r2, #SHADOWFRAME_RESULT_REGISTER_OFFSET]
/* Remember the code_item */
str r1, [r2, #SHADOWFRAME_CODE_ITEM_OFFSET]
/* set up "named" registers */
mov rSELF, r0
ldr r0, [r2, #SHADOWFRAME_NUMBER_OF_VREGS_OFFSET]
add rFP, r2, #SHADOWFRAME_VREGS_OFFSET @ point to vregs.
VREG_INDEX_TO_ADDR rREFS, r0 @ point to reference array in shadow frame
ldr r0, [r2, #SHADOWFRAME_DEX_PC_OFFSET] @ Get starting dex_pc.
add rPC, r1, #CODEITEM_INSNS_OFFSET @ Point to base of insns[]
add rPC, rPC, r0, lsl #1 @ Create direct pointer to 1st dex opcode
EXPORT_PC
/* Starting ibase */
ldr rIBASE, [rSELF, #THREAD_CURRENT_IBASE_OFFSET]
/* start executing the instruction at rPC */
FETCH_INST @ load rINST from rPC
GET_INST_OPCODE ip @ extract opcode from rINST
GOTO_OPCODE ip @ jump to next instruction
/* NOTE: no fallthrough */
The above is just part of the code, and let's take a look at one of the operations,
op_mov.
/* ------------------------------ */
.balign 128
.L_op_move: /* 0x01 */
/* File: arm/op_move.S */
/* for move, move-object, long-to-int */
/* op vA, vB */
mov r1, rINST, lsr #12 @ r1<- B from 15:12
ubfx r0, rINST, #8, #4 @ r0<- A from 11:8
FETCH_ADVANCE_INST 1 @ advance rPC, load rINST
GET_VREG r2, r1 @ r2<- fp[B]
GET_INST_OPCODE ip @ ip<- opcode from rINST
.if 0
SET_VREG_OBJECT r2, r0 @ fp[A]<- r2
.else
SET_VREG r2, r0 @ fp[A]<- r2
.endif
GOTO_OPCODE ip @ execute next instruction
/* ------------------------------ */
Now we have a more clear picture of the execution flow:
ExecuteMterpImpl is the entry of one interpret unit, method in this case. It would use
GOTO_OPCODE macro to query and execute the first op code. At the end of each op execution handler, it would fetch next op code, query and starts another op execution. The query procedure is documented in README.txt under
mterp folder, in short, every op handler's entry point is located at the base of handler table + (opcode * 128), that is, shift left 7 bits. This approach create a linear, stream execution and of course, contains nearly no branch conditions. By the way, although it's not common to see, there are actually lots of places in AOSP use hand-written assembly to boost the execution performance, for example, the memory allocator for ART.
ART's interpreter try really hard to avoid branch execution and pipeline stalling. Although this techniques had already appeared in Dalvik, I think this sort of interpreter design is probably the fastest at this time.