2016年6月14日 星期二

Facebook Android SDK in 10 Minutes - Login Part

I appreciate Facebook for providing a fancy Android SDK, prevent us from touching the messy OAuth2 authorization flow. However, the official documentation page is not fancy at all. Fragmented code snippets and unclear description statements just make beginners even more confused. So here is some quick start for beginners.
(From official document page)"If the access token is available already assign it" ??!

Scenario

Due to the internal design of SDK, integration of Facebook Login may depend on you app structure. In this article, I'm going to show one of the scenarios.
  1. User would see a splash Activity(an entry Activity) right after launching the app.
  2. The splash Activity would check the login information.
  3. If the user had login, jump to the main Activity, otherwise, jump to the login/register Activity. (Of course, the main Activity would use Facebook APIs).
This login flow is slightly different from that in official documentation page, which seems to leverage Fragment. Also, I simply use the LoginButton widget from SDK.
(Note: Facebook SDK version here is v4.12.1. I skip the setup steps which, unlike the development documents, is written pretty well in the official tutorial.)

Implementation

In the splash Activity(here we call "SplashActivity"), we have the following code snippet. 

public class SplashActivity extends AppCompatActivity {

    private Intent mLoginIntent;
    private Intent mMainIntent;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);

        mMainIntent = new Intent(this, MainActivity.class);
        mLoginIntent = new Intent(this, LoginActivity.class);

        FacebookSdk.sdkInitialize(getApplicationContext(), new FacebookSdk.InitializeCallback() {
             @Override
             public void onInitialized() {
                 checkFbLogin(AccessToken.getCurrentAccessToken());
             }
        });
        AppEventsLogger.activateApp(getApplication());

    }

    private void checkFbLogin(AccessToken currentToken){
        if(currentToken != null){
            startActivity(mMainIntent);
        }else{
            startActivity(mLoginIntent);
        }
        finish();
    }
}

FacebookSdk.sdkInitialize() must be called as early as you can. SplashActivity as previous mentioned, need to check the login information. Lots of people had complained that AccessToken.getCurrentAccessToken() didn't behave like the document said, which should return null if use hasn't login. The key is that the real "token loader" is FacebookSdk.sdkInitialize(), which is asynchronous, so you need to wait a little bit for the SDK initialization. Nevertheless, one should leverage the "on complete callback" of FacebookSdk.sdkInitialize() instead of pausing for a while by Thread.sleep() or Handler. Because the proper pausing interval may vary from device to device.

Here is the LoginActivity.

@Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);

        if(!FacebookSdk.isInitialized()){
            FacebookSdk.sdkInitialize(getApplicationContext());
            AppEventsLogger.activateApp(getApplication());
        }

        setContentView(R.layout.activity_login);

        // Login button provided by SDK
        LoginButton butFbLogin = (LoginButton)findViewById(R.id.but_fb_login);
        assert butFbLogin != null;

        mLoginCallbackManager = CallbackManager.Factory.create();

        butFbLogin.setReadPermissions("email");
        butFbLogin.registerCallback(mLoginCallbackManager, new FacebookCallback<LoginResult>() {
            @Override
            public void onSuccess(LoginResult loginResult) {
                if(loginResult.getAccessToken() != null){
                    Intent intent = new Intent(LoginActivity.this, MainActivity.class);
                    intent.addFlags(Intent.FLAG_ACTIVITY_CLEAR_TASK | Intent.FLAG_ACTIVITY_NEW_TASK);
                    startActivity(intent);
                }else{
                    Log.e(TAG, "Returned FB access token null");
                }
            }

            @Override
            public void onCancel() {
                //TODO
            }

            @Override
            public void onError(FacebookException error) {
                error.printStackTrace();
                Toast.makeText(LoginActivity.this, "Error Login to Facebook", Toast.LENGTH_LONG).show();
            }
        });
    }

    @Override
    protected void onActivityResult(int requestCode, int resultCode, Intent data){
        super.onActivityResult(requestCode, resultCode, data);
        mLoginCallbackManager.onActivityResult(requestCode, resultCode, data);
    }

The login page is rather straight forward. Few things need to be careful: Facebook.sdkInitialized() must be called before setContentView(). It seems that parts of XML of widgets provided by SDK is modified or generated dynamically. Also, if a user revoke permission for your app from her account, AccessToken.getCurrentAccessToken() would still return non-null result. So you may need extra care during API calls.

2016年3月21日 星期一

Ignition - The Interpreter in V8 Javascript Engine

Mentioned in my previous article, V8 started to use a new interpreter, Ignition, to replace baseline compiler on some low-memory devices. Let's check it out!

Interpreters are notorious for its performance bottleneck, and they're also proved to be slower than most of the JIT runtimes. I've explained some of the bottlenecks of interpreters and background knowledge in one of my blog post(section 2 and 3). So I think folks in Chromium team must spend lots of efforts on Ignition before bringing it rivaling performance with the old JIT solution.

There has already a document written by Chromium team about Ignition. To summarize its content, Ignition uses a streaming execution flow which executes one byte code handler after another. That is, a byte code handler function would recursively invoke the dispatcher at the end of its function body. Of course, tail call optimization is required to avoid smashing the stack. The following figure illustrates overview of the execution flow.
So let's start our code journey. Everything starts from (src/interpreter/interpreter.cc) Interpreter::Initialize(). We encounter a strange macro immediately: GENERATE_CODE(Name, ...).
#define GENERATE_CODE(Name, ...)                                        \
  {                                                                     \
    InterpreterAssembler assembler(isolate_, &zone, Bytecode::k##Name); \
    Do##Name(&assembler);                                               \
    Handle<Code> code = assembler.GenerateCode();                       \
    dispatch_table_[Bytecodes::ToByte(Bytecode::k##Name)] = *code;      \
    TraceCodegen(code);                                                 \
    LOG_CODE_EVENT(isolate_,                                            \
                   CodeCreateEvent(Logger::BYTECODE_HANDLER_TAG,        \
                                   AbstractCode::cast(*code), #Name));  \
  }
  BYTECODE_LIST(GENERATE_CODE)
#undef GENERATE_CODE

It is undefined right away, so we can inferred that it is only used by BYTECODE_LIST. As its name, BYTECODE_LIST declares all of the byte codes. And GENERATE_CODE here just help it defining their handlers and some dispatch information. The Do##Name(&assembler) statement seems to has something to do with byte code handlers, let's grep one of them and see what's going on.
// Mov <src> <dst>
//
// Stores the value of register <src> to register <dst>.
void Interpreter::DoMov(InterpreterAssembler* assembler) {
  Node* src_index = __ BytecodeOperandReg(0);
  Node* src_value = __ LoadRegister(src_index);
  Node* dst_index = __ BytecodeOperandReg(1);
  __ StoreRegister(src_value, dst_index);
  __ Dispatch();
}

This is the byte code handler for mov. It looks just like a normal byte code handler, right? Then have a closer look: Do##Name(&assembler) is a function call instead of a callback object we usually seen in many interpreters. In another word, Do##Name(&assembler) acts more like a routine responds for configuring byte code actions. After scrolling to the top of interpreter.cc, you can find that the inconspicuous "__" above turns out to be pretty important.
#define __ assembler->

InterpreterAssembler is respond for encapsulating a small group of assembly lines as a byte code handler, and also, abstracting the underlying platform to provide a uniform assembly code emitting interface here. What's more, we can witness more power on this kind of design in comparison with normal byte-code-handler-function fashions.

Next, we'll focus on the InterpreterAssembler::Dispatch() function. It will be delegated to several routines.
void InterpreterAssembler::Dispatch() {
  DispatchTo(Advance(Bytecodes::Size(bytecode_)));
}

void InterpreterAssembler::DispatchTo(Node* new_bytecode_offset) {
  Node* target_bytecode = Load(
      MachineType::Uint8(), BytecodeArrayTaggedPointer(), new_bytecode_offset);
  if (kPointerSize == 8) {
    target_bytecode = ChangeUint32ToUint64(target_bytecode);
  }

  // TODO(rmcilroy): Create a code target dispatch table to avoid conversion
  // from code object on every dispatch.
  Node* target_code_object =
      Load(MachineType::Pointer(), DispatchTableRawPointer(),
           WordShl(target_bytecode, IntPtrConstant(kPointerSizeLog2)));

  DispatchToBytecodeHandler(target_code_object, new_bytecode_offset);
}

void InterpreterAssembler::DispatchToBytecodeHandler(Node* handler,
                                                     Node* bytecode_offset) {
  if (FLAG_trace_ignition) {
    TraceBytecode(Runtime::kInterpreterTraceBytecodeExit);
  }

  InterpreterDispatchDescriptor descriptor(isolate());
  Node* args[] = {GetAccumulator(),          RegisterFileRawPointer(),
                  bytecode_offset,           BytecodeArrayTaggedPointer(),
                  DispatchTableRawPointer(), GetContext()};
  TailCall(descriptor, handler, args, 0);
}

The Advance() call in Dispatch() is clearly to be stepping to next byte code, and Load() call in DispatchTo() is responds for loading byte code object. Everything looks normal, except the TailCall() at the bottom of DispatchToBytecodeHandler().

As mentioned in the above sections, Ignition invokes byte code handlers
one after another in conjunction with dispatchers. Another key factor is that tail call optimization is required for this concatenation approach or stack overflow may occur. TailCall() in the above snippet reside in CodeStubAssembler class, which is also the parent class of InterpreterAssembler, it will then delegate the flow to RawMachineAssembler::TailCallN().
Node* RawMachineAssembler::TailCallN(CallDescriptor* desc, Node* function,
                                     Node** args) {
  int param_count =
      static_cast<int>(desc->GetMachineSignature()->parameter_count());
  int input_count = param_count + 1;
  Node** buffer = zone()->NewArray<Node*>(input_count);
  int index = 0;
  buffer[index++] = function;
  for (int i = 0; i < param_count; i++) {
    buffer[index++] = args[i];
  }
  Node* tail_call = MakeNode(common()->TailCall(desc), input_count, buffer);
  NodeProperties::MergeControlToEnd(graph(), common(), tail_call);
  schedule()->AddTailCall(CurrentBlock(), tail_call);
  current_block_ = nullptr;
  return tail_call;
}
There are many V8's abstraction types like Node and CallDescriptor in the above code, let's just not dig into those objects first. What we can infer is that a Node object seems to represent a piece of execution code and TailCallN() creates a special "tail call node" that is appended at the bottom of current execution block.

This approach sounds reasonable, and it also appeal the pros we mentioned before on the adoption of InterpreterAssembler rather than byte code handler functions. If we use byte code handler functions as handlers and also want to have the same cascading execution fashion, tail call optimization would all be depending on compiler. I'm not judging that modern compilers can't do this job well, but we want more explicit controls on the behavior of our interpreter. In contrast, the InterpreterAssembler abstracts handlers into code pieces which can be freely arranged in any place in the execution flow. So getting the same effect of tail call optimization is just a piece of code(cake)

Last but not the least, we want to know more about the Node class. It turn our that Node is just an interface representing a node in a graph. There are lots of implementation in the entire V8 codebase. Like AstGraph, BytecodeGraph and RawMachineAssembler in our case. 

There are two common traits in both Ignition and Android ART's interpreter mentioned in my previous article

  1. They call byte code handlers and dispatchers(byte code decoders) one after another without returning to the parent function or using loop statement to iterate though byte codes. 
  2. They directly emit assembly code as the main execution core for both performance and design concern(like reaching tail call optimization). 

In my opinion, both of the traits above represent the latest modern interpreter design. We can use them in many projects as long as they have similar interpreting execution flow. 

2016年3月16日 星期三

Interpreter in Android ART

Previously, we discussed about the new JIT execution engine in ART debuted in MarshMellow codebase. At the end of that article, I considered that the design of the interpreter is more important than the JIT compiler since interpreter is often the performance bottleneck. Thus, ART's interpreter is the very thing we're going to talk about in this article.

Over the decades, people who are working on interpreter has one conclusion: Interpreting would never be faster than compiled code execution. ART's interpreter  is not an exception either. But you don't need to be too sad: It's just a tradeoff among performance and other factors like cross-platform. There is no silver bullet at this time. One of the notorious key factors degrading the performance of interpreter is CPU's pipeline stalling. (Feel free to skip the following two sections if you're familiar with pipeline stalling)

CPU's pipeline needs to be as full as it can to reach peak performance and maximum instruction parallelism. But branch operations(e.g. if, switch) would interrupt this model, because before evaluating the condition expressions(e.g. expression between parentheses of if(...)), you can't fill any instructions into pipeline since CPU doesn't know which branch to go. Thus, CPU would try to "guess" which branch to execute and load instructions of that branch ahead before the evaluation of condition expressions. The reason is that if CPU "guess wrong", the cost of reloading correct instructions is same as the cost to wait for completeness of condition expressions(i.e. "stalling" of the pipeline). This techniques is called branch prediction.

However, there are a few things that would make branch prediction even harder: indirect addressing and large amount of condition branches to name a few. Indirect addressing is used in various of applications. For example, indirect function call, which the address of invoked function is stored in a pointer variable. Indirect addressing had been able to be improved in modern CPUs.  What we're interested is the second issue, large condition branches. Most of the branch prediction techniques use history-based profiling. In short, they  record the amount of appearance of each branch and perform prediction based on those statistics. Thus, too many condition branches, for example, large switch block, may not perform well in branch prediction since the branch history storage is limited, large amount of condition branches may frequently flush the storage.

According to the above explanation, interpreter which uses a large switch to decode op codes is not very efficient. However, the switch implementation is simpler to build and may have chance to do some higher level task in comparison with the assembly approach introduce later. Thus, many famous projects still use this approach. For example, Firefox's javascript engine, Spider Monkey, used this sort of techniques in their interpreter few years ago. ART's interpreter in our case also use switch implementation as a fallback which we would mention in the following sections.

So let's start our journey into ART's interpreter from Execute() in runtime/interpreter/interpreter.cc. First, it would check that whether there exists a JIT compiled version of this method if we're at the beginning of a it.
  if (LIKELY(shadow_frame.GetDexPC() == 0)) {  // Entering the method, but not via deoptimization.
    if (kIsDebugBuild) {
      self->AssertNoPendingException();
    }
    instrumentation::Instrumentation* instrumentation = Runtime::Current()->GetInstrumentation();
    ArtMethod *method = shadow_frame.GetMethod();

    if (UNLIKELY(instrumentation->HasMethodEntryListeners())) {
      instrumentation->MethodEnterEvent(self, shadow_frame.GetThisObject(code_item->ins_size_),
                                        method, 0);
    }

    jit::Jit* jit = Runtime::Current()->GetJit();
    if (jit != nullptr && jit->CanInvokeCompiledCode(method)) {
      JValue result;

      // Pop the shadow frame before calling into compiled code.
      self->PopShadowFrame();
      ArtInterpreterToCompiledCodeBridge(self, code_item, &shadow_frame, &result);
      // Push the shadow frame back as the caller will expect it.
      self->PushShadowFrame(&shadow_frame);

      return result;
    }
  }

Then, it would select the proper interpreter implementation.
    if (kInterpreterImplKind == kMterpImplKind) {
      if (transaction_active) {
        // No Mterp variant - just use the switch interpreter.
        return ExecuteSwitchImpl<false, true>(self, code_item, shadow_frame, result_register,
                                              false);
      } else if (UNLIKELY(!Runtime::Current()->IsStarted())) {
        return ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame, result_register,
                                               false);
      } else {
        while (true) {
          // Mterp does not support all instrumentation/debugging.
          if (MterpShouldSwitchInterpreters()) {
            return ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame, result_register,
                                                   false);
          }
          bool returned = ExecuteMterpImpl(self, code_item, &shadow_frame, &result_register);
          if (returned) {
            return result_register;
          } else {
            // Mterp didn't like that instruction.  Single-step it with the reference interpreter.
            result_register = ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame,
                                                               result_register, true);
            if (shadow_frame.GetDexPC() == DexFile::kDexNoIndex) {
              // Single-stepped a return or an exception not handled locally.  Return to caller.
              return result_register;
            }
          }
        }
      }
    } else if (kInterpreterImplKind == kSwitchImplKind) {

The target interpreter we're going to research is called Mterp, which probably stands for "Macro inTerpreter". It use hand written assembly code to decode and dispatch op code handling efficiently. It seems cool but actually this is the exact way how Dalvik interprets dex byte code, it just take some time to port Dalvik's interpreter to ART.

In addition to kInterpreterImplKind variable, there are other chances that would make interpreter fallback to old implementation like switch or goto. The first is MterpShouldSwitchInterpreters() in line 12 , but it is actually more like a debug sugar. In runtime/interpreter/mterp/mterp.cc:
extern "C" bool MterpShouldSwitchInterpreters()
    SHARED_REQUIRES(Locks::mutator_lock_) {
  const instrumentation::Instrumentation* const instrumentation =
      Runtime::Current()->GetInstrumentation();
  return instrumentation->NonJitProfilingActive() || Dbg::IsDebuggerActive();
}

The main entry point of Mterp is ExecuteMterpImpl(). We'll postpone analyzing that function and see the next few lines of code first. The if statement take return value of ExecuteMterpImpl() as condition expression ensures that if Mterp can't handle an instruction, there is a fallback solution since as the comment says, Mterp can't handle all of the instrumentation or debugging. But the question is: why there is a infinite while loop around these code? It turn out that ExecuteMterpImpl() would execute the whole method where ExecuteSwitchImpl() would only execute one instruction a time("single-stepped").

So, where is ExecuteMterpImpl() ? How does it use assembly code to boost the interpreting?

When it comes to assembly code, it means that this part of code is architecture dependent, which lives in separated folders under runtime/interpreter/mterp. Let's take arm for example. Under arm folder, there are lots of files which file name starts with "op_" contain only one line, indicating one instruction. The fact is that these scattered files need to be combined into a single assembly code: mterp_arm.S and generate into out folder by a script, rebuild.sh. There are also some files which names starts with "config-" allow one to configure this code generating process. Finally, in mterp_arm.S, we found our lovely ExecuteMterpImpl
.text
    .align  2
    .global ExecuteMterpImpl
    .type   ExecuteMterpImpl, %function

/*
 * On entry:
 *  r0  Thread* self/
 *  r1  code_item
 *  r2  ShadowFrame
 *  r3  JValue* result_register
 *
 */

ExecuteMterpImpl:
    .fnstart
    .save {r4-r10,fp,lr}
    stmfd   sp!, {r4-r10,fp,lr}         @ save 9 regs
    .pad    #4
    sub     sp, sp, #4                  @ align 64

    /* Remember the return register */
    str     r3, [r2, #SHADOWFRAME_RESULT_REGISTER_OFFSET]

    /* Remember the code_item */
    str     r1, [r2, #SHADOWFRAME_CODE_ITEM_OFFSET]

    /* set up "named" registers */
    mov     rSELF, r0
    ldr     r0, [r2, #SHADOWFRAME_NUMBER_OF_VREGS_OFFSET]
    add     rFP, r2, #SHADOWFRAME_VREGS_OFFSET     @ point to vregs.
    VREG_INDEX_TO_ADDR rREFS, r0 @ point to reference array in shadow frame
    ldr     r0, [r2, #SHADOWFRAME_DEX_PC_OFFSET]   @ Get starting dex_pc.
    add     rPC, r1, #CODEITEM_INSNS_OFFSET     @ Point to base of insns[]
    add     rPC, rPC, r0, lsl #1 @ Create direct pointer to 1st dex opcode
    EXPORT_PC

    /* Starting ibase */
    ldr     rIBASE, [rSELF, #THREAD_CURRENT_IBASE_OFFSET]

    /* start executing the instruction at rPC */
    FETCH_INST                          @ load rINST from rPC
    GET_INST_OPCODE ip                  @ extract opcode from rINST
    GOTO_OPCODE ip                      @ jump to next instruction
    /* NOTE: no fallthrough */

The above is just part of the code, and let's take a look at one of the operations, op_mov.
/* ------------------------------ */
    .balign 128
.L_op_move: /* 0x01 */
/* File: arm/op_move.S */
    /* for move, move-object, long-to-int */
    /* op vA, vB */
    mov     r1, rINST, lsr #12          @ r1<- B from 15:12
    ubfx    r0, rINST, #8, #4           @ r0<- A from 11:8
    FETCH_ADVANCE_INST 1                @ advance rPC, load rINST
    GET_VREG r2, r1                     @ r2<- fp[B]
    GET_INST_OPCODE ip                  @ ip<- opcode from rINST
    .if 0
    SET_VREG_OBJECT r2, r0              @ fp[A]<- r2
    .else
    SET_VREG r2, r0                     @ fp[A]<- r2
    .endif
    GOTO_OPCODE ip                      @ execute next instruction

/* ------------------------------ */

Now we have a more clear picture of the execution flow: ExecuteMterpImpl is the entry of one interpret unit, method in this case. It would use GOTO_OPCODE macro to query and execute the first op code. At the end of each op execution handler, it would fetch next op code, query and starts another op execution. The query procedure is documented in README.txt under mterp folder, in short, every op handler's entry point is located at the base of handler table + (opcode * 128), that is, shift left 7 bits. This approach create a linear, stream execution and of course, contains nearly no branch conditions. By the way, although it's not common to see, there are actually lots of places in AOSP use hand-written assembly to boost the execution performance, for example, the memory allocator for ART.

ART's interpreter try really hard to avoid branch execution and pipeline stalling. Although this techniques had already appeared in Dalvik, I think this sort of interpreter design is probably the fastest at this time.


2016年3月9日 星期三

JIT In Android ART

Interestingly, although Android claim that it's latest Runtime(ART) adopts Ahead-Of-Time(AOT) compiling, a jit folder was silently shipped into art/compiler folder within AOSP around the early era of Marshmallow.

Fact is that: the installation procedure takes a long time on some of the devices running on ART. E.g. Facebook sometimes takes 2 minutes to install! Perhaps that's the reason why Android want to move back to JIT.

Projects usually use interpreter along with JIT engine. That is, interpreting the code or byte code first and collecting the profile information including how often a method is executed, aka. how "hot" the method is, and type information if you're working with dynamic type language. After several turns, if a method is "hot" enough, the execution engine would use the JIT compiler to compile the code into native code and delegate the execution to the native compiled method in every invoking of that method afterward. E.g. Dalvik VM's JIT engine.

Nevertheless, there are also some projects don't use interpreter, but instead using an extremely fast compiler to compile each method executes next ahead before using another optimizing compiler to do more optimized compiling on those "hot" methods. E.g. Google V8 javascript engine.

The latter approach is usually faster, but to my surprise, the new ART JIT adopts the first, the interpreter combo.

The great journey of ART's JIT starts from art/runtime/jit/jit_instrumentation.cc. Instrumentation in ART acts like a listener listens for various of interpreting or compilation events. E.g. methods invoking, branches and OSR(On Stack Replacement). JitInstrumentationCache::CreateThreadPool() adds the JitInstrumentationListener instance to the runtime instrumentation set.

JitInstrumentationListener listens to three events: method entered(JitInstrumentationListener::MethodEntered()), branches(JitInstrumentationListener::Branch()) and virtual or interface method invoking(JitInstrumentationListener::InvokeVirtualOrInterface()). The compilation triggers, instrumentation_cache_->AddSamples(...), reside within method entered and branches callbacks

JitInstrumentationCache::AddSamples() shows that ART JIT uses a slightly modified counter approach to profile execution flows. Usually, JIT compiler simply set a counter threshold and trigger compilation task after exceeding that value. But there seems to be THREE counter thresholds in this case: warm_method_threshold_, hot_method_threshold_ and osr_method_threshold_. Constructing a JIT system with more levels. The values are passed from the JVM arguments(JVM is an interface, not an unique instance, ART is one of the implementations) but I can't find those arguments at this time. But from the code arrangement we can inferred that warm_method_threshold < hot_method_threshold < osr_method_threshold. I'm also wondering how osr_method_threshold woks.

If one of the thresholds is reached, it would arrange a JitCompileTask. The following flow is pretty interesting:  Jit::CompileMethod() would be invoked, but Jit::Compile() is actually a stub of jit_compile_method(). What's special about jit_compile_method()? It's a C symbol loaded from dynamic library libart-compiler.so.  libart-compiler.so has nothing special, it's source files live side by side with source files mentioned above, I think modularization is the main reason why they adopt this kind of ad-hoc approach.

After going into jit_compile_method()OptimizingCompiler::TryCompile() would be called. Few months ago, there are two compilation levels in TryCompile(): CompileBaseline and CompileOptimized. But now, those levels is replaced by a neater, single level  approach:
  // Try compiling a method and return the code generator used for
  // compiling it.
  // This method:
  // 1) Builds the graph. Returns null if it failed to build it.
  // 2) Transforms the graph to SSA. Returns null if it failed.
  // 3) Runs optimizations on the graph, including register allocator.
  // 4) Generates code with the `code_allocator` provided.
  CodeGenerator* TryCompile(ArenaAllocator* arena,
                            CodeVectorAllocator* code_allocator,
                            const DexFile::CodeItem* code_item,
                            uint32_t access_flags,
                            InvokeType invoke_type,
                            uint16_t class_def_idx,
                            uint32_t method_idx,
                            jobject class_loader,
                            const DexFile& dex_file,
                            Handle<mirror::DexCache> dex_cache,
                            bool osr) const;

The comments had explained almost everything. The graph is an instance of HGraph class, which is easy to perform various of compiler optimizations. ART JIT use a method-based JIT compiler in contrast with the old DalvikVM JIT, which use trace-based compiler and switch to method-based compiler only under device charging.

In summary, JIT in ART doesn't seem to use any special techniques, so in my opinion, the key of performance falls on the interpreter, I would take some time researching on that part.

2016年1月11日 星期一

[Quick Note] V8 Javascript Engine's First Stage Compiler

V8 is properly, in my opinion, the fastest javascript runtime at the time(No offense SpiderMonkey, although benchmarks differ from one to another, V8 got the best grade in average). There are some well known properties that make it run so fast:

  • No interpreter, use baseline compiler instead. But this characteristic has just been subverted on the bleeding edge of the development tree. 
  • No intermediate representation(IR) in baseline compiler. But optimization(second stage) compiler do use its own IR.
  • Use method-based JIT strategy. This is fairly controversial, but as we can see later, use function as its compilation scope makes lots of works easier in comparison with trace-based strategy, which spends lots of efforts on determining trace boundary.
  • Hidden class. In short, if a variable modifies its object layout, adding a new field for example, v8 create a new layout(class) instead of changing the origin object layout(e.g. Using linked list to "chain" object fields, which appears in many implementation of early era javascript runtime). 
  • Code patch and inline cache. These two feature are less novel since they have already beed used in lots of dynamic type runtime for over twenty years.
I'm going to talk about the baseline compiler. The code journey starts from Script::Compile API which as the name suggest, compiles the script. V8 has a fabulous characteristic that it can be used standalone, that's the very key which bears Node.js. You can checkout V8's embedder guide for related resource.

After several asserting checks, the main compilation flow comes to Compiler::CompileScript(compiler.cc:1427). One of the important things done here is checking code cache. (compiler.cc:1465)

    // First check per-isolate compilation cache.
    maybe_result = compilation_cache->LookupScript(
        source, script_name, line_offset, column_offset, resource_options,
        context, language_mode);
    if (maybe_result.is_null() && FLAG_serialize_toplevel &&
        compile_options == ScriptCompiler::kConsumeCodeCache &&
        !isolate->debug()->is_loaded()) {
      // Then check cached code provided by embedder.
      HistogramTimerScope timer(isolate->counters()->compile_deserialize());
      Handle<SharedFunctionInfo> result;
      if (CodeSerializer::Deserialize(isolate, *cached_data, source)
              .ToHandle(&result)) {
        // Promote to per-isolate compilation cache.
        compilation_cache->PutScript(source, context, language_mode, result);
        return result;
      }
      // Deserializer failed. Fall through to compile.
    }

If there are previous compiled code(or snapshot in v8's term), it would deserialize them from disk. Otherwise, the real compilation procedure would be triggered, namely, CompileToplevel (compiler.cc:1242). In this function, parsing information and compilation information would be setup and passed to CompileBaselineCode (compiler.cc:817). Let's ignore the Compiler::Analyze first and jump to GenerateBaselineCode.
Here comes some interesting things: (compiler.cc:808)
static bool GenerateBaselineCode(CompilationInfo* info) {
  if (FLAG_ignition && UseIgnition(info)) {
    return interpreter::Interpreter::MakeBytecode(info);
  } else {
    return FullCodeGenerator::MakeCode(info);
  }
}

Previous says that one of the advantages V8 has is entering compilation process directly without interpreting ahead. So why there are codes related to interpreter ? It turns out that V8 is also going to have a interpreter! Here is a brief introduction about Ignition, the new interpreter engine. But let's just stare on the compilation part first.

So now we comes to FullCodeGenerator::MakeCode (full-codegen/full-codegen.cc:26). There are a few important things here where we'll come back to, for now, let's focus on FullCodeGenerator::Generate .
Now here's the exciting part. FullCodeGenerator::Generate is a architecture-specific function, which residents in source code files in separated folders like arm, x64 .etc. Let's pick few lines of code from x64 version:
      if (locals_count >= 128) {
        Label ok;
        __ movp(rcx, rsp);
        __ subp(rcx, Immediate(locals_count * kPointerSize));
        __ CompareRoot(rcx, Heap::kRealStackLimitRootIndex);
        __ j(above_equal, &ok, Label::kNear);
        __ CallRuntime(Runtime::kThrowStackOverflow);
        __ bind(&ok);
      }

Pretty familiar right? It seems that it directly generates the assembly code!
Normally this sort of approach would appears in education project like homework in college compiler courses. Mature compiler framework often has lots of procedures on generating native codes. Directly generate native assembly is thus one of the most important factors that speed up V8.
Another worth mentioned thing is the "__" at the prefix of each several lines above. It's a macro defined in full-codegen/full-codegen.cc:
#define __ ACCESS_MASM(masm())
Here MASM does not means Microsoft's MASM where the first 'M' character of the latter one stands for Microsoft. Our MASM stands for macro assembler, which is responded for assembly code generating.

Back to FullCodeGenerator::Generate. VisitStatements seems to be the code emitter for js function body. Now we stop the code tracing here and step back to see the declaration of FullCodeGenerator.(full-codegen/full-codegen.h) 

class FullCodeGenerator: public AstVisitor
It turns out that the code generator itself is a huge AST visitor! And VisitStatements is the top level visitor function that would dispatch different part of code to visitors like FullCodeGenerator::VisitSwitchStatementFullCodeGenerator::VisitForInStatement or FullCodeGenerator::VisitAssignment to name a few.

Last but not the least, some people say Ignition, the interpreter mentioned before, is not like SpiderMonkey or JavascriptCore(Safari). I'm really curious about that, maybe I would wrote another article about that : )


2016年1月7日 星期四

[Quick Note] Type instances in LLVM IR

The story starts while I was writing a LLVM module pass few days ago.
I was trying to add null pointer initializer for a global variable, the ConstantPointerNull class need a PointerType as parameter, but I accidentally assigned a wrong address space (we're target on CL language) while creating PointerType instance. Then I found the error message pretty interesting:
opt: /my/llvm/path/lib/IR/Globals.cpp:209: void llvm::GlobalVariable::setInitializer(llvm::Constant*): Assertion `InitVal->getType() == getType()->getElementType() && "Initializer type must match GlobalVariable type"' failed.
The question is: Why comparing two pointers?  Because when we're talking about representation for Type, the straightforward way would be a simple Type(or its derived class) class instance. So why it compares two pointers instead of instances?

The first possibility came to my mind was operator overloading on "==" for Type*. But sadly, there isn't such overloading for Type and its derived classes. Finally, it turns out that LLVM "caches" all of the Type instances in the LLVMContext class(actually LLVMContextImpl, which is not exported). One of the evidences is the static factory methods for creating Type instances, for example, getInt8PtrTy:
static PointerType * getInt8PtrTy (LLVMContext &C, unsigned AS=0)
We need to pass a LLVMContext instance as a parameter. So when creating Type instances, it would grab from LLVMContext instead of creating a new one. For example, the FunctionType factory method(/lib/IR/Type.cpp: 360~379):
 
// FunctionType::get - The factory function for the FunctionType class.
FunctionType *FunctionType::get(Type *ReturnType,
                                ArrayRef<Type*> Params, bool isVarArg) {
  LLVMContextImpl *pImpl = ReturnType->getContext().pImpl;
  FunctionTypeKeyInfo::KeyTy Key(ReturnType, Params, isVarArg);
  auto I = pImpl->FunctionTypes.find_as(Key);
  FunctionType *FT;

  if (I == pImpl->FunctionTypes.end()) {
    FT = (FunctionType*) pImpl->TypeAllocator.
      Allocate(sizeof(FunctionType) + sizeof(Type*) * (Params.size() + 1),
               AlignOf<FunctionType>::Alignment);
    new (FT) FunctionType(ReturnType, Params, isVarArg);
    pImpl->FunctionTypes.insert(FT);
  } else {
    FT = *I;
  }

  return FT;
}
It first get the LLVMContextImpl instance, pImpl, and retrieves the storage for FunctionType, FunctionTypes, which declaration is here(/lib/IR/LLVMContextImpl.h: 1003~1004):
 
typedef DenseSet<FunctionType *, FunctionTypeKeyInfo> FunctionTypeSet;
FunctionTypeSet FunctionTypes;
I'm not pretty sure why LLVM want to do this. I guess they want to reduce the memory footprint due to the heavy usages of Type class in IR operations.


2016年1月2日 星期六

[Quick Note] OrcJIT in LLVM

JIT(Just in Time) compilation becomes more and more popular in recent years due to the rising demand on higher dynamic type language performance, Javascript and Python, to name a few.

However, LLVM, which is developed mainly for static type language like C/C++, also joined JIT's battlefield few years ago. Although applying LLVM on dynamic compilation has some drawbacks, for example, long running optimizations increase latencies and inadequate type system which is originally developed for static type rather than dynamic type. It doesn't stop the community from working on this subsystem. More and more enhancements come up like native support for patch points and stack maps, which debut in version 3.8 as experiment features.

There are mainly two kinds of compilation strategies: method base and trace base. Difference between them falls on the compilation scope they adopt. As their name suggests, method-based approach take functions or methods as minimum compilation units. Google's V8 Javascript engine is a well-known example. Trace-based JIT use traces, where we can treat them as a range of control flow, as the compilation units. Firefox's TraceMonkey and Lua's JIT use this kind of approach. Which of them is better may required several academic papers to explain and the debate still goes on now.

Back to LLVM, the main JIT implementation is called MCJIT. It adopted the popular MC- framework within LLVM, which is kind of LLVM's own assembler, as its backend to gain more flexibility. MCJIT use non of strategies mentioned above, instead, it use LLVM's Module as the compilation unit. You can use the llvm::EngineBuilder class to build a llvm::ExecutionEngine, which is the main JIT interface, then you can either retrieve compiled symbols or run compiled functions directly by its rich APIs like getPointerToFunction or runFunction.

There are several drawbacks on using module as compilation unit, in short, module scope is too big in comparison with method and trace, which may spend too much time on compilation procedure and increase latency. And that's what OrcJIT tries to fix.

Orc is the shorthand of on-request-compiling. As the name suggests, it lazily performs the compilation process of each functions within the module until it is used. OrcJIT also build from scratch instead of building on MCJIT. Even more, it introduced a concept called Layer. Layer is like LLVM's IR pass, but handles compilation procedure rather than IR. It provides flexibilities for building your own JIT compilation flow. Nevertheless, OrcJIT still lacks of central layer manager similar to llvm::PassManager and even common layer interface! In another word, we can only construct our own compilation "stack" by passing the base layer instance to the first argument of another layer constructor and non of the layer class inherit a common parent class as interface. For example, the lli program construct its compilation stack as below:
OrcLazyJIT(std::unique_ptr<TargetMachine> TM,
             std::unique_ptr<CompileCallbackMgr> CCMgr,
             IndirectStubsManagerBuilder IndirectStubsMgrBuilder,
             bool InlineStubs)
             : TM(std::move(TM)), DL(this->TM->createDataLayout()),
             CCMgr(std::move(CCMgr)),
             ObjectLayer(),
             CompileLayer(ObjectLayer, orc::SimpleCompiler(*this->TM)),
             IRDumpLayer(CompileLayer, createDebugDumper()),
             CODLayer(IRDumpLayer, extractSingleFunction, *this->CCMgr,
                 std::move(IndirectStubsMgrBuilder), InlineStubs),
             CXXRuntimeOverrides(
                 [this](const std::string &S) { return mangle(S); }) {}

The compilation flow would run as the order: ObjectLayer, CompileLayer, IRDumpLayer, CODLayer. Where CODLayer is the instance of CompileOnDemandLayer and treated as the highest level class in OrcJIT. We would use CODLayer to add module by addModuleSet and invoke findSymbol to retrieve compiled function symbols. OrcJIT would differ the compilation of each function until we called findSymbol.

Pros of OrcJIT would be the decreasing compilation region, brings potential to add profiler for detecting hot method which is used in merely every dynamic compilation framework nowadays. On the other hand, OrcJIT does not use the llvm::ExecutionEngine interface. Although it provides llvm::orc::OrcMCJITReplacement as an wrapper,  the main author tends to use its own interface, this raises lots of questions and arguments on the mailing list from the debut of OrcJIT. What's more, as previous mentioned, OrcJIT's layer still has no common interface and layer manger.

I was pretty excited when I saw OrcJIT, because smaller compilation unit is what trace-based JIT overwhelms method-based JIT. It looks that LLVM gets another armor in the battlefield of dynamic compilation.