2017年6月24日 星期六

Device Interfaces in Tensorflow

Tensorflow just got overwhelming over the developer world in recent years. It enables a developer, even a newbie in the machine learning world, to build a neural network in just couple of minutes. Also, it is designed to run on various of devices and platforms, like CPU, GPU, and distributed system. In this article we're going to focus on the latter feature, to see how Tensorflow interacts with different hardware to perform those heavy computations.

Scenario

We want to find out the way to add new devices into Tensorflow in order to execute kernels on top of them.

In the ideal world, every kernel implementation is independent to the underlying device. That is, one can execute a kernel on various of devices without, or with minimum modification on the kernel code. Such that if we want to use an alternative device, say FPGA, to power the computations, all we need to do is implementing some sort of device interface(maybe a C++ class?), rather than rewriting all of the kernels. Kernels would leverage the device interface to perform critical, and more primitive(in comparison with the ML algorithm on top of that) calculations. For example, matrix multiplications.

Take 1: The tensorflow::Device class

Let's start with this SO question. The answer indicated that if we want to add a new device, we need to implement the tensorflow::Device class and register it with some macro. In that class (tensorflow/core/common_runtime/device.h) there is a worth-noting virtual function that is most likely be the place where the target-specific computation logic is implemented: Device::Compute.

  // Performs the actual compute function.
  //
  // Subclasses may override this function if they wish to perform
  // some initialization before each compute.
  virtual void Compute(OpKernel* op_kernel, OpKernelContext* context) {
    op_kernel->Compute(context);
  }

Unlike the Compute function in OpKernel class, which performs the real computation, the Compute function here acts more like a wrapper around all of the OpKernel::Compute. Just like the comment says, this function is responsible for setting up or initializing the device context before each kernel computation. Let's look at a concrete example: The BaseGPUDevice (tensorflow/core/common_runtime/gpu/gpu_device.h).

Thing we care most is the BaseGPUDevice::Compute function, and basically all of its primary computation logic would be delegated to BaseGPUDevice::ComputeHelper,  which is shown below(some verbose code have been trimmed).

void BaseGPUDevice::ComputeHelper(OpKernel* op_kernel,
                                  OpKernelContext* context) {
  GPUDeviceContext* gpu_device_context = device_contexts_[0];
  if (context->op_device_context() != nullptr) {
    gpu_device_context =
        static_cast<GPUDeviceContext*>(context->op_device_context());
  }
  gpu::Stream* stream = gpu_device_context->stream();
  //const auto stream_id = gpu_device_context->stream_id();

  const auto num_streams = streams_.size();
  if (num_streams > 1) {
    // If this op's device context is different from the other contexts,
    // we must wait on the stream.
    for (int i = 0; i < context->num_inputs(); ++i) {
      const GPUDeviceContext* idc =
          static_cast<GPUDeviceContext*>(context->input_device_context(i));

      if (idc->stream() != stream) stream->ThenWaitFor(idc->stream());
    }
  }
  gpu::cuda::ScopedActivateExecutorContext scoped_activation{stream->parent()};
  op_kernel->Compute(context);
  if (context->status().ok()) {
    if (sync_every_op_) {
      // Note: GPUUtil::Sync() only syncs the default stream.
      // We need to either sync the stream used by this op, or
      // all streams.  Given that this flag is typically used for
      // debugging it makes more sense to sync all GPU activity.
      context->SetStatus(GPUUtil::SyncAll(this));
    }
  }
}

Before calling the kernel's Compute function in line 23, this function would wait for the input arguments to finish if their (CUDA) streams are different from the kernel's (line 19).

Now it's pretty clear that this interface is NOT the one we desire in the scenario mentioned previously. It seems that the target-specific logics are implemented in the kernel. For example, the bias_op kernel is separated into two classes: Bias and BiasGPU, which are located in bias_op.cc and bias_op_gpu.cu.cc under tensorflow/core/kernels, respectively. CUDA code are hard-coded into the implementation of BiasGPU::compute (line 78 ~ 88).

  CudaLaunchConfig config = GetCudaLaunchConfig(total_count, d);
  if (data_format == FORMAT_NHWC) {
    BiasNHWCKernel<
        T><<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
        config.virtual_thread_count, input, bias, output, bias_size);
  } else {
    BiasNCHWKernel<
        T><<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
        config.virtual_thread_count, input, bias, output, bias_size,
        image_size);
  }

Then in runtime, framework would pick the corresponding kernel variant depends on the launching configuration.

Take 2: The Eigen::TensorDevice class

During our inspection on tensorflow/core/kernels/bias_op_gpu.cu.cc, we found that there is an interesting thing in line 29:
typedef Eigen::GpuDevice GPUDevice;
Though instances of this type are only used for accessing CUDA stream in this file, we're still curious about the relationship between Eigen and Tensorflow here.

Eigen is a famous linear algebra library, and Tensorflow used it heavily in its codebase since there are many linear algebra calculations in ML algorithm, for example, again, matrix multiplications. The question is: How does Tensorflow take advantage of Eigen?

Before moving forward, you should know that due to some reasons related to the build system, part of the Eigen library code introduced here is not presented in the Tensorflow codebase, it's stored in here and here. I recommend you to learn where these two URLs can be found from this[1] note, in case that they may vary from version to version.

Let's look at QuantizeAndDequantizeOneScaleImpl::Compute (tensorflow/core/kernels/quantize_and_dequantize_op.h). First, we can find that this class is not divided into separated CPU and GPU variants. The Device type, which would eventually be resolved to types that based on Eigen::TensorDevice,  and argument Device& d, play the main roles here. Here is an example of how the latter one is used (line 71 ~ 76):

        out.device(d) =
            ((input.cwiseMin(max_range).cwiseMax(min_range) - min_range) *
                 scale +
             T(0.5)).floor() *
                inverse_scale +
            min_range;

out variable above is a Tensor, and the RHS expression above would also be resolved to a Tensor. Nevertheless, instead of assigning result of RHS to the out variable directly, the evaluation of RHS expression would be postponed and the entire RHS expression would be delegated to Eigen::TensorDevice::operator= . Implementations of Eigen::TensorDevice, Eigen::GpuDevice [2] for example, would be responsible for executing the RHS expression that passed in.

Summary

Now we know there are two ways to enable kernel execution on new devices:
  1. Modify the kernel source with device specific code or add another variant of that kernel (e.g Bias and BiasGPU ). 
  2. Implement another Eigen::TensorDevice. (e.g Eigen::GpuDevice )
Methods above are complementary, adopt different one depends on properties of the device and kernels. For example, if operations are strongly related to linear algebra, the second method is more adequate; otherwise, the first one might be more expressive although it might require lots of kernel modifications (not every kernels I think, since there are kernels that just can't be executed on devices other than CPU).

[1]: Tensorflow choose to use part of the Eigen library without any modification, so not until the first build would the build system fetch archived libraries files from official repository of Eigen. The aforementioned behavior is written in tensorflow/workspace.bzl, line 148.

[2]: Eigen/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceCuda.h

2017年4月28日 星期五

[Quick Tutorial] How to USE LLVM with CMake

Perhaps I hang around in the LLVM developers world for too long, that I almost forget that LLVM is a framework consist of libraries - libraries for application developers to USE.

Recently I found that there are some developers who use LLVM (and related) tools like clang and llc, have some experience surfing in the LLVM codebase, and even played around with some small hacks - But they're not sure how to USE LLVM as normal libraries during application developments.

Of course, you can just use the similar semantic below to build your application with LLVM.
g++ -I<LLVM Include Path> -L<LLVM Library Path> foo.cc -o foo -lLLVMSomeLib
But well, It's 21st, we got CMake to make C++ building process more neat and easier. And since LLVM had migrated to CMake as its default and only build system since version 3.7, it provides some great CMake modules for application developers to integrate LLVM into their projects faster and error free. Here are two examples for small and large projects.

Small Projects

I prefer to say "small executable" rather than small project, actually. Because just like what I'd described, this example is suitable for scenario that you just need to embed LLVM in single executable. Maybe a scratch program to illustrate how certain components work or you're using the utilities libraries within LLVM, command line library for example, in your executable. 

First, in your CMakeLists.txt, define LLVM_DIR variable with path to the cmake module folder within LLVM install dir(Yeah, I know the variable name is pretty confused). Usually the path is <LLVM Install Folder>/lib/cmake/llvm. After that, find the LLVM package:
find_package(LLVM REQUIRED CONFIG) 
Then there are several variables been defined for us to use. Including the include folder path:
include_directories(${LLVM_INCLUDE_DIRS}) 
And some compiler definitions:
add_definitions(${LLVM_DEFINITION})
What about libraries? Here comes the magic, a convenient cmake function is provided to resolve the burden of library names and path:
llvm_map_components_to_libnames(_LLVM_LIBS support core irreader) 
_LLVM_LIBS variable is used as output variable, so it would store all the path to the given component libraries. Support, Core and IRReader components in this case, we just need to give the component names, in lowercase, then llvm_map_components_to_libnames would translate it into LLVMCore, LLVMSupport etc.

Finally, we link those libraries to our target:
add_executable(fooProgram foo.cpp bar.cpp) target_link_libraries(fooProgram ${_LLVM_LIBS})
Here is the full code list:
cmake_minimum_required(VERSION 3.6)
project(LLVMExperiments CXX)

set(CMAKE_CXX_STANDARD 11)

set(LLVM_DIR /usr/local/opt/llvm-clang/4.0/lib/cmake/llvm)
find_package(LLVM REQUIRED CONFIG)
message(STATUS "Use LLVM ${LLVM_PACKAGE_VERSION}")
include_directories(${LLVM_INCLUDE_DIRS})
add_definitions(${LLVM_DEFINITION})

# LLVM libraries are static linking by default
# So pick only the component libraries you need
# in case your final executable becomes large in size
llvm_map_components_to_libnames(_LLVM_LIBS support core analysis irreader)

set(SOURCE_FILES
        main.cpp)
add_executable(LLVMExperiments ${SOURCE_FILES})
target_link_libraries(LLVMExperiments ${_LLVM_LIBS})

Large Projects / Loadable Components

As the section title indicates, this method is the only way I figure out how to build LLVM loadable components, especially LLVM Pass, with CMake. Also, this method is strongly recommended if you want to develop tools or components that may merge into upstream LLVM codebase in the future.

The first thing you may need to change is the project's folder structure. We're going to create a subdirectory to accommodate main code for your tools, and create another CMakeLists.txt within it. So there would be two cmake build scripts: <Project Folder>/CMakeLists.txt and <Project Folder>/<Tool Folder>/CMakeLists.txt.

Let's look at the first build script, the one in upper directory. In the previous section, "Small Project", some of you may bump into a problem that the LLVM library you use isn't built with RTTI, so you need to add "-fno-rtti" flags to compiler option in your cmake build script. But you want to make the cmake build script more general, so you add the following code:
if (NOT ${LLVM_ENABLE_RTTI})
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-rtti")
endif()

Well, that's the RTTI case...what about other similar flags? It's pretty cumbersome to add bunch of if-else statements for them. Here comes the solution:
list(APPEND CMAKE_MODULE_PATH ${LLVM_CMAKE_DIR})

set(LLVM_RUNTIME_OUTPUT_INTDIR ${CMAKE_BINARY_DIR}/${CMAKE_CFG_INTDIR}/bin)
set(LLVM_LIBRARY_OUTPUT_INTDIR ${CMAKE_BINARY_DIR}/${CMAKE_CFG_INTDIR}/lib)

include(HandleLLVMOptions)

Add the lines above after the LLVM package-finding statement. LLVM_RUNTIME_OUTPUT_INTDIR and LLVM_LIBRARY_OUTPUT_INTDIR variables are necessary for HandleLLVMOptions module, but feel free to set them with whatever path you want to output the result binaries and libraries.

Finally in our upper directory build script, add two additional statements:
include(AddLLVM)
add_subdirectory(MyToolSubDirectory)

MyToolSubDirectory is the subdirectory name for your tool, library or loadable component.
Here is the full code list for the build script in upper directory:

cmake_minimum_required(VERSION 3.7)
project(LLVMExperiments2 CXX)

set(CMAKE_CXX_STANDARD 11)

set(LLVM_DIR /usr/local/opt/llvm-clang/current/lib/cmake/llvm)
find_package(LLVM REQUIRED CONFIG)
message(STATUS "Using LLVM version ${LLVM_PACKAGE_VERSION}")

list(APPEND CMAKE_MODULE_PATH ${LLVM_CMAKE_DIR})

set(LLVM_RUNTIME_OUTPUT_INTDIR ${CMAKE_BINARY_DIR}/${CMAKE_CFG_INTDIR}/bin)
set(LLVM_LIBRARY_OUTPUT_INTDIR ${CMAKE_BINARY_DIR}/${CMAKE_CFG_INTDIR}/lib)

include(HandleLLVMOptions)
include(AddLLVM)

add_definitions(${LLVM_DEFINITIONS})
include_directories(${LLVM_INCLUDE_DIRS})

add_subdirectory(MyToolSubDirectory)

Let's go forward to the subdirectory. This subdirectory is where your code resides. The cmake build script here, however, is pretty simple. If you're building executable, here are the only few statements you need to add:

set(LLVM_LINK_COMPONENTS
        Support
        Core
        IRReader
        )

add_llvm_executable(MyExecutable
        main.cpp
        foo.cpp
        )

MyExecutable is the name for your executable. Defining LLVM_LINK_COMPONENTS variable is obviously adding component libraries to link against your executable.

For loadable components, the build script is even simpler:

add_llvm_loadable_module(MyLLVMPass
        pass.cpp
        foo.cpp
        )

That's it!

Here is a simple LLVM pass example I wrote for your reference:
https://github.com/mshockwave/LLVM-Sample-Pass 

2016年6月14日 星期二

Facebook Android SDK in 10 Minutes - Login Part

I appreciate Facebook for providing a fancy Android SDK, prevent us from touching the messy OAuth2 authorization flow. However, the official documentation page is not fancy at all. Fragmented code snippets and unclear description statements just make beginners even more confused. So here is some quick start for beginners.
(From official document page)"If the access token is available already assign it" ??!

Scenario

Due to the internal design of SDK, integration of Facebook Login may depend on you app structure. In this article, I'm going to show one of the scenarios.
  1. User would see a splash Activity(an entry Activity) right after launching the app.
  2. The splash Activity would check the login information.
  3. If the user had login, jump to the main Activity, otherwise, jump to the login/register Activity. (Of course, the main Activity would use Facebook APIs).
This login flow is slightly different from that in official documentation page, which seems to leverage Fragment. Also, I simply use the LoginButton widget from SDK.
(Note: Facebook SDK version here is v4.12.1. I skip the setup steps which, unlike the development documents, is written pretty well in the official tutorial.)

Implementation

In the splash Activity(here we call "SplashActivity"), we have the following code snippet. 

public class SplashActivity extends AppCompatActivity {

    private Intent mLoginIntent;
    private Intent mMainIntent;

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);

        mMainIntent = new Intent(this, MainActivity.class);
        mLoginIntent = new Intent(this, LoginActivity.class);

        FacebookSdk.sdkInitialize(getApplicationContext(), new FacebookSdk.InitializeCallback() {
             @Override
             public void onInitialized() {
                 checkFbLogin(AccessToken.getCurrentAccessToken());
             }
        });
        AppEventsLogger.activateApp(getApplication());

    }

    private void checkFbLogin(AccessToken currentToken){
        if(currentToken != null){
            startActivity(mMainIntent);
        }else{
            startActivity(mLoginIntent);
        }
        finish();
    }
}

FacebookSdk.sdkInitialize() must be called as early as you can. SplashActivity as previous mentioned, need to check the login information. Lots of people had complained that AccessToken.getCurrentAccessToken() didn't behave like the document said, which should return null if use hasn't login. The key is that the real "token loader" is FacebookSdk.sdkInitialize(), which is asynchronous, so you need to wait a little bit for the SDK initialization. Nevertheless, one should leverage the "on complete callback" of FacebookSdk.sdkInitialize() instead of pausing for a while by Thread.sleep() or Handler. Because the proper pausing interval may vary from device to device.

Here is the LoginActivity.

@Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);

        if(!FacebookSdk.isInitialized()){
            FacebookSdk.sdkInitialize(getApplicationContext());
            AppEventsLogger.activateApp(getApplication());
        }

        setContentView(R.layout.activity_login);

        // Login button provided by SDK
        LoginButton butFbLogin = (LoginButton)findViewById(R.id.but_fb_login);
        assert butFbLogin != null;

        mLoginCallbackManager = CallbackManager.Factory.create();

        butFbLogin.setReadPermissions("email");
        butFbLogin.registerCallback(mLoginCallbackManager, new FacebookCallback<LoginResult>() {
            @Override
            public void onSuccess(LoginResult loginResult) {
                if(loginResult.getAccessToken() != null){
                    Intent intent = new Intent(LoginActivity.this, MainActivity.class);
                    intent.addFlags(Intent.FLAG_ACTIVITY_CLEAR_TASK | Intent.FLAG_ACTIVITY_NEW_TASK);
                    startActivity(intent);
                }else{
                    Log.e(TAG, "Returned FB access token null");
                }
            }

            @Override
            public void onCancel() {
                //TODO
            }

            @Override
            public void onError(FacebookException error) {
                error.printStackTrace();
                Toast.makeText(LoginActivity.this, "Error Login to Facebook", Toast.LENGTH_LONG).show();
            }
        });
    }

    @Override
    protected void onActivityResult(int requestCode, int resultCode, Intent data){
        super.onActivityResult(requestCode, resultCode, data);
        mLoginCallbackManager.onActivityResult(requestCode, resultCode, data);
    }

The login page is rather straight forward. Few things need to be careful: Facebook.sdkInitialized() must be called before setContentView(). It seems that parts of XML of widgets provided by SDK is modified or generated dynamically. Also, if a user revoke permission for your app from her account, AccessToken.getCurrentAccessToken() would still return non-null result. So you may need extra care during API calls.

2016年3月21日 星期一

Ignition - The Interpreter in V8 Javascript Engine

Mentioned in my previous article, V8 started to use a new interpreter, Ignition, to replace baseline compiler on some low-memory devices. Let's check it out!

Interpreters are notorious for its performance bottleneck, and they're also proved to be slower than most of the JIT runtimes. I've explained some of the bottlenecks of interpreters and background knowledge in one of my blog post(section 2 and 3). So I think folks in Chromium team must spend lots of efforts on Ignition before bringing it rivaling performance with the old JIT solution.

There has already a document written by Chromium team about Ignition. To summarize its content, Ignition uses a streaming execution flow which executes one byte code handler after another. That is, a byte code handler function would recursively invoke the dispatcher at the end of its function body. Of course, tail call optimization is required to avoid smashing the stack. The following figure illustrates overview of the execution flow.
So let's start our code journey. Everything starts from (src/interpreter/interpreter.cc) Interpreter::Initialize(). We encounter a strange macro immediately: GENERATE_CODE(Name, ...).
#define GENERATE_CODE(Name, ...)                                        \
  {                                                                     \
    InterpreterAssembler assembler(isolate_, &zone, Bytecode::k##Name); \
    Do##Name(&assembler);                                               \
    Handle<Code> code = assembler.GenerateCode();                       \
    dispatch_table_[Bytecodes::ToByte(Bytecode::k##Name)] = *code;      \
    TraceCodegen(code);                                                 \
    LOG_CODE_EVENT(isolate_,                                            \
                   CodeCreateEvent(Logger::BYTECODE_HANDLER_TAG,        \
                                   AbstractCode::cast(*code), #Name));  \
  }
  BYTECODE_LIST(GENERATE_CODE)
#undef GENERATE_CODE

It is undefined right away, so we can inferred that it is only used by BYTECODE_LIST. As its name, BYTECODE_LIST declares all of the byte codes. And GENERATE_CODE here just help it defining their handlers and some dispatch information. The Do##Name(&assembler) statement seems to has something to do with byte code handlers, let's grep one of them and see what's going on.
// Mov <src> <dst>
//
// Stores the value of register <src> to register <dst>.
void Interpreter::DoMov(InterpreterAssembler* assembler) {
  Node* src_index = __ BytecodeOperandReg(0);
  Node* src_value = __ LoadRegister(src_index);
  Node* dst_index = __ BytecodeOperandReg(1);
  __ StoreRegister(src_value, dst_index);
  __ Dispatch();
}

This is the byte code handler for mov. It looks just like a normal byte code handler, right? Then have a closer look: Do##Name(&assembler) is a function call instead of a callback object we usually seen in many interpreters. In another word, Do##Name(&assembler) acts more like a routine responds for configuring byte code actions. After scrolling to the top of interpreter.cc, you can find that the inconspicuous "__" above turns out to be pretty important.
#define __ assembler->

InterpreterAssembler is respond for encapsulating a small group of assembly lines as a byte code handler, and also, abstracting the underlying platform to provide a uniform assembly code emitting interface here. What's more, we can witness more power on this kind of design in comparison with normal byte-code-handler-function fashions.

Next, we'll focus on the InterpreterAssembler::Dispatch() function. It will be delegated to several routines.
void InterpreterAssembler::Dispatch() {
  DispatchTo(Advance(Bytecodes::Size(bytecode_)));
}

void InterpreterAssembler::DispatchTo(Node* new_bytecode_offset) {
  Node* target_bytecode = Load(
      MachineType::Uint8(), BytecodeArrayTaggedPointer(), new_bytecode_offset);
  if (kPointerSize == 8) {
    target_bytecode = ChangeUint32ToUint64(target_bytecode);
  }

  // TODO(rmcilroy): Create a code target dispatch table to avoid conversion
  // from code object on every dispatch.
  Node* target_code_object =
      Load(MachineType::Pointer(), DispatchTableRawPointer(),
           WordShl(target_bytecode, IntPtrConstant(kPointerSizeLog2)));

  DispatchToBytecodeHandler(target_code_object, new_bytecode_offset);
}

void InterpreterAssembler::DispatchToBytecodeHandler(Node* handler,
                                                     Node* bytecode_offset) {
  if (FLAG_trace_ignition) {
    TraceBytecode(Runtime::kInterpreterTraceBytecodeExit);
  }

  InterpreterDispatchDescriptor descriptor(isolate());
  Node* args[] = {GetAccumulator(),          RegisterFileRawPointer(),
                  bytecode_offset,           BytecodeArrayTaggedPointer(),
                  DispatchTableRawPointer(), GetContext()};
  TailCall(descriptor, handler, args, 0);
}

The Advance() call in Dispatch() is clearly to be stepping to next byte code, and Load() call in DispatchTo() is responds for loading byte code object. Everything looks normal, except the TailCall() at the bottom of DispatchToBytecodeHandler().

As mentioned in the above sections, Ignition invokes byte code handlers
one after another in conjunction with dispatchers. Another key factor is that tail call optimization is required for this concatenation approach or stack overflow may occur. TailCall() in the above snippet reside in CodeStubAssembler class, which is also the parent class of InterpreterAssembler, it will then delegate the flow to RawMachineAssembler::TailCallN().
Node* RawMachineAssembler::TailCallN(CallDescriptor* desc, Node* function,
                                     Node** args) {
  int param_count =
      static_cast<int>(desc->GetMachineSignature()->parameter_count());
  int input_count = param_count + 1;
  Node** buffer = zone()->NewArray<Node*>(input_count);
  int index = 0;
  buffer[index++] = function;
  for (int i = 0; i < param_count; i++) {
    buffer[index++] = args[i];
  }
  Node* tail_call = MakeNode(common()->TailCall(desc), input_count, buffer);
  NodeProperties::MergeControlToEnd(graph(), common(), tail_call);
  schedule()->AddTailCall(CurrentBlock(), tail_call);
  current_block_ = nullptr;
  return tail_call;
}
There are many V8's abstraction types like Node and CallDescriptor in the above code, let's just not dig into those objects first. What we can infer is that a Node object seems to represent a piece of execution code and TailCallN() creates a special "tail call node" that is appended at the bottom of current execution block.

This approach sounds reasonable, and it also appeal the pros we mentioned before on the adoption of InterpreterAssembler rather than byte code handler functions. If we use byte code handler functions as handlers and also want to have the same cascading execution fashion, tail call optimization would all be depending on compiler. I'm not judging that modern compilers can't do this job well, but we want more explicit controls on the behavior of our interpreter. In contrast, the InterpreterAssembler abstracts handlers into code pieces which can be freely arranged in any place in the execution flow. So getting the same effect of tail call optimization is just a piece of code(cake)

Last but not the least, we want to know more about the Node class. It turn our that Node is just an interface representing a node in a graph. There are lots of implementation in the entire V8 codebase. Like AstGraph, BytecodeGraph and RawMachineAssembler in our case. 

There are two common traits in both Ignition and Android ART's interpreter mentioned in my previous article

  1. They call byte code handlers and dispatchers(byte code decoders) one after another without returning to the parent function or using loop statement to iterate though byte codes. 
  2. They directly emit assembly code as the main execution core for both performance and design concern(like reaching tail call optimization). 

In my opinion, both of the traits above represent the latest modern interpreter design. We can use them in many projects as long as they have similar interpreting execution flow. 

2016年3月16日 星期三

Interpreter in Android ART

Previously, we discussed about the new JIT execution engine in ART debuted in MarshMellow codebase. At the end of that article, I considered that the design of the interpreter is more important than the JIT compiler since interpreter is often the performance bottleneck. Thus, ART's interpreter is the very thing we're going to talk about in this article.

Over the decades, people who are working on interpreter has one conclusion: Interpreting would never be faster than compiled code execution. ART's interpreter  is not an exception either. But you don't need to be too sad: It's just a tradeoff among performance and other factors like cross-platform. There is no silver bullet at this time. One of the notorious key factors degrading the performance of interpreter is CPU's pipeline stalling. (Feel free to skip the following two sections if you're familiar with pipeline stalling)

CPU's pipeline needs to be as full as it can to reach peak performance and maximum instruction parallelism. But branch operations(e.g. if, switch) would interrupt this model, because before evaluating the condition expressions(e.g. expression between parentheses of if(...)), you can't fill any instructions into pipeline since CPU doesn't know which branch to go. Thus, CPU would try to "guess" which branch to execute and load instructions of that branch ahead before the evaluation of condition expressions. The reason is that if CPU "guess wrong", the cost of reloading correct instructions is same as the cost to wait for completeness of condition expressions(i.e. "stalling" of the pipeline). This techniques is called branch prediction.

However, there are a few things that would make branch prediction even harder: indirect addressing and large amount of condition branches to name a few. Indirect addressing is used in various of applications. For example, indirect function call, which the address of invoked function is stored in a pointer variable. Indirect addressing had been able to be improved in modern CPUs.  What we're interested is the second issue, large condition branches. Most of the branch prediction techniques use history-based profiling. In short, they  record the amount of appearance of each branch and perform prediction based on those statistics. Thus, too many condition branches, for example, large switch block, may not perform well in branch prediction since the branch history storage is limited, large amount of condition branches may frequently flush the storage.

According to the above explanation, interpreter which uses a large switch to decode op codes is not very efficient. However, the switch implementation is simpler to build and may have chance to do some higher level task in comparison with the assembly approach introduce later. Thus, many famous projects still use this approach. For example, Firefox's javascript engine, Spider Monkey, used this sort of techniques in their interpreter few years ago. ART's interpreter in our case also use switch implementation as a fallback which we would mention in the following sections.

So let's start our journey into ART's interpreter from Execute() in runtime/interpreter/interpreter.cc. First, it would check that whether there exists a JIT compiled version of this method if we're at the beginning of a it.
  if (LIKELY(shadow_frame.GetDexPC() == 0)) {  // Entering the method, but not via deoptimization.
    if (kIsDebugBuild) {
      self->AssertNoPendingException();
    }
    instrumentation::Instrumentation* instrumentation = Runtime::Current()->GetInstrumentation();
    ArtMethod *method = shadow_frame.GetMethod();

    if (UNLIKELY(instrumentation->HasMethodEntryListeners())) {
      instrumentation->MethodEnterEvent(self, shadow_frame.GetThisObject(code_item->ins_size_),
                                        method, 0);
    }

    jit::Jit* jit = Runtime::Current()->GetJit();
    if (jit != nullptr && jit->CanInvokeCompiledCode(method)) {
      JValue result;

      // Pop the shadow frame before calling into compiled code.
      self->PopShadowFrame();
      ArtInterpreterToCompiledCodeBridge(self, code_item, &shadow_frame, &result);
      // Push the shadow frame back as the caller will expect it.
      self->PushShadowFrame(&shadow_frame);

      return result;
    }
  }

Then, it would select the proper interpreter implementation.
    if (kInterpreterImplKind == kMterpImplKind) {
      if (transaction_active) {
        // No Mterp variant - just use the switch interpreter.
        return ExecuteSwitchImpl<false, true>(self, code_item, shadow_frame, result_register,
                                              false);
      } else if (UNLIKELY(!Runtime::Current()->IsStarted())) {
        return ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame, result_register,
                                               false);
      } else {
        while (true) {
          // Mterp does not support all instrumentation/debugging.
          if (MterpShouldSwitchInterpreters()) {
            return ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame, result_register,
                                                   false);
          }
          bool returned = ExecuteMterpImpl(self, code_item, &shadow_frame, &result_register);
          if (returned) {
            return result_register;
          } else {
            // Mterp didn't like that instruction.  Single-step it with the reference interpreter.
            result_register = ExecuteSwitchImpl<false, false>(self, code_item, shadow_frame,
                                                               result_register, true);
            if (shadow_frame.GetDexPC() == DexFile::kDexNoIndex) {
              // Single-stepped a return or an exception not handled locally.  Return to caller.
              return result_register;
            }
          }
        }
      }
    } else if (kInterpreterImplKind == kSwitchImplKind) {

The target interpreter we're going to research is called Mterp, which probably stands for "Macro inTerpreter". It use hand written assembly code to decode and dispatch op code handling efficiently. It seems cool but actually this is the exact way how Dalvik interprets dex byte code, it just take some time to port Dalvik's interpreter to ART.

In addition to kInterpreterImplKind variable, there are other chances that would make interpreter fallback to old implementation like switch or goto. The first is MterpShouldSwitchInterpreters() in line 12 , but it is actually more like a debug sugar. In runtime/interpreter/mterp/mterp.cc:
extern "C" bool MterpShouldSwitchInterpreters()
    SHARED_REQUIRES(Locks::mutator_lock_) {
  const instrumentation::Instrumentation* const instrumentation =
      Runtime::Current()->GetInstrumentation();
  return instrumentation->NonJitProfilingActive() || Dbg::IsDebuggerActive();
}

The main entry point of Mterp is ExecuteMterpImpl(). We'll postpone analyzing that function and see the next few lines of code first. The if statement take return value of ExecuteMterpImpl() as condition expression ensures that if Mterp can't handle an instruction, there is a fallback solution since as the comment says, Mterp can't handle all of the instrumentation or debugging. But the question is: why there is a infinite while loop around these code? It turn out that ExecuteMterpImpl() would execute the whole method where ExecuteSwitchImpl() would only execute one instruction a time("single-stepped").

So, where is ExecuteMterpImpl() ? How does it use assembly code to boost the interpreting?

When it comes to assembly code, it means that this part of code is architecture dependent, which lives in separated folders under runtime/interpreter/mterp. Let's take arm for example. Under arm folder, there are lots of files which file name starts with "op_" contain only one line, indicating one instruction. The fact is that these scattered files need to be combined into a single assembly code: mterp_arm.S and generate into out folder by a script, rebuild.sh. There are also some files which names starts with "config-" allow one to configure this code generating process. Finally, in mterp_arm.S, we found our lovely ExecuteMterpImpl
.text
    .align  2
    .global ExecuteMterpImpl
    .type   ExecuteMterpImpl, %function

/*
 * On entry:
 *  r0  Thread* self/
 *  r1  code_item
 *  r2  ShadowFrame
 *  r3  JValue* result_register
 *
 */

ExecuteMterpImpl:
    .fnstart
    .save {r4-r10,fp,lr}
    stmfd   sp!, {r4-r10,fp,lr}         @ save 9 regs
    .pad    #4
    sub     sp, sp, #4                  @ align 64

    /* Remember the return register */
    str     r3, [r2, #SHADOWFRAME_RESULT_REGISTER_OFFSET]

    /* Remember the code_item */
    str     r1, [r2, #SHADOWFRAME_CODE_ITEM_OFFSET]

    /* set up "named" registers */
    mov     rSELF, r0
    ldr     r0, [r2, #SHADOWFRAME_NUMBER_OF_VREGS_OFFSET]
    add     rFP, r2, #SHADOWFRAME_VREGS_OFFSET     @ point to vregs.
    VREG_INDEX_TO_ADDR rREFS, r0 @ point to reference array in shadow frame
    ldr     r0, [r2, #SHADOWFRAME_DEX_PC_OFFSET]   @ Get starting dex_pc.
    add     rPC, r1, #CODEITEM_INSNS_OFFSET     @ Point to base of insns[]
    add     rPC, rPC, r0, lsl #1 @ Create direct pointer to 1st dex opcode
    EXPORT_PC

    /* Starting ibase */
    ldr     rIBASE, [rSELF, #THREAD_CURRENT_IBASE_OFFSET]

    /* start executing the instruction at rPC */
    FETCH_INST                          @ load rINST from rPC
    GET_INST_OPCODE ip                  @ extract opcode from rINST
    GOTO_OPCODE ip                      @ jump to next instruction
    /* NOTE: no fallthrough */

The above is just part of the code, and let's take a look at one of the operations, op_mov.
/* ------------------------------ */
    .balign 128
.L_op_move: /* 0x01 */
/* File: arm/op_move.S */
    /* for move, move-object, long-to-int */
    /* op vA, vB */
    mov     r1, rINST, lsr #12          @ r1<- B from 15:12
    ubfx    r0, rINST, #8, #4           @ r0<- A from 11:8
    FETCH_ADVANCE_INST 1                @ advance rPC, load rINST
    GET_VREG r2, r1                     @ r2<- fp[B]
    GET_INST_OPCODE ip                  @ ip<- opcode from rINST
    .if 0
    SET_VREG_OBJECT r2, r0              @ fp[A]<- r2
    .else
    SET_VREG r2, r0                     @ fp[A]<- r2
    .endif
    GOTO_OPCODE ip                      @ execute next instruction

/* ------------------------------ */

Now we have a more clear picture of the execution flow: ExecuteMterpImpl is the entry of one interpret unit, method in this case. It would use GOTO_OPCODE macro to query and execute the first op code. At the end of each op execution handler, it would fetch next op code, query and starts another op execution. The query procedure is documented in README.txt under mterp folder, in short, every op handler's entry point is located at the base of handler table + (opcode * 128), that is, shift left 7 bits. This approach create a linear, stream execution and of course, contains nearly no branch conditions. By the way, although it's not common to see, there are actually lots of places in AOSP use hand-written assembly to boost the execution performance, for example, the memory allocator for ART.

ART's interpreter try really hard to avoid branch execution and pipeline stalling. Although this techniques had already appeared in Dalvik, I think this sort of interpreter design is probably the fastest at this time.


2016年3月9日 星期三

JIT In Android ART

Interestingly, although Android claim that it's latest Runtime(ART) adopts Ahead-Of-Time(AOT) compiling, a jit folder was silently shipped into art/compiler folder within AOSP around the early era of Marshmallow.

Fact is that: the installation procedure takes a long time on some of the devices running on ART. E.g. Facebook sometimes takes 2 minutes to install! Perhaps that's the reason why Android want to move back to JIT.

Projects usually use interpreter along with JIT engine. That is, interpreting the code or byte code first and collecting the profile information including how often a method is executed, aka. how "hot" the method is, and type information if you're working with dynamic type language. After several turns, if a method is "hot" enough, the execution engine would use the JIT compiler to compile the code into native code and delegate the execution to the native compiled method in every invoking of that method afterward. E.g. Dalvik VM's JIT engine.

Nevertheless, there are also some projects don't use interpreter, but instead using an extremely fast compiler to compile each method executes next ahead before using another optimizing compiler to do more optimized compiling on those "hot" methods. E.g. Google V8 javascript engine.

The latter approach is usually faster, but to my surprise, the new ART JIT adopts the first, the interpreter combo.

The great journey of ART's JIT starts from art/runtime/jit/jit_instrumentation.cc. Instrumentation in ART acts like a listener listens for various of interpreting or compilation events. E.g. methods invoking, branches and OSR(On Stack Replacement). JitInstrumentationCache::CreateThreadPool() adds the JitInstrumentationListener instance to the runtime instrumentation set.

JitInstrumentationListener listens to three events: method entered(JitInstrumentationListener::MethodEntered()), branches(JitInstrumentationListener::Branch()) and virtual or interface method invoking(JitInstrumentationListener::InvokeVirtualOrInterface()). The compilation triggers, instrumentation_cache_->AddSamples(...), reside within method entered and branches callbacks

JitInstrumentationCache::AddSamples() shows that ART JIT uses a slightly modified counter approach to profile execution flows. Usually, JIT compiler simply set a counter threshold and trigger compilation task after exceeding that value. But there seems to be THREE counter thresholds in this case: warm_method_threshold_, hot_method_threshold_ and osr_method_threshold_. Constructing a JIT system with more levels. The values are passed from the JVM arguments(JVM is an interface, not an unique instance, ART is one of the implementations) but I can't find those arguments at this time. But from the code arrangement we can inferred that warm_method_threshold < hot_method_threshold < osr_method_threshold. I'm also wondering how osr_method_threshold woks.

If one of the thresholds is reached, it would arrange a JitCompileTask. The following flow is pretty interesting:  Jit::CompileMethod() would be invoked, but Jit::Compile() is actually a stub of jit_compile_method(). What's special about jit_compile_method()? It's a C symbol loaded from dynamic library libart-compiler.so.  libart-compiler.so has nothing special, it's source files live side by side with source files mentioned above, I think modularization is the main reason why they adopt this kind of ad-hoc approach.

After going into jit_compile_method()OptimizingCompiler::TryCompile() would be called. Few months ago, there are two compilation levels in TryCompile(): CompileBaseline and CompileOptimized. But now, those levels is replaced by a neater, single level  approach:
  // Try compiling a method and return the code generator used for
  // compiling it.
  // This method:
  // 1) Builds the graph. Returns null if it failed to build it.
  // 2) Transforms the graph to SSA. Returns null if it failed.
  // 3) Runs optimizations on the graph, including register allocator.
  // 4) Generates code with the `code_allocator` provided.
  CodeGenerator* TryCompile(ArenaAllocator* arena,
                            CodeVectorAllocator* code_allocator,
                            const DexFile::CodeItem* code_item,
                            uint32_t access_flags,
                            InvokeType invoke_type,
                            uint16_t class_def_idx,
                            uint32_t method_idx,
                            jobject class_loader,
                            const DexFile& dex_file,
                            Handle<mirror::DexCache> dex_cache,
                            bool osr) const;

The comments had explained almost everything. The graph is an instance of HGraph class, which is easy to perform various of compiler optimizations. ART JIT use a method-based JIT compiler in contrast with the old DalvikVM JIT, which use trace-based compiler and switch to method-based compiler only under device charging.

In summary, JIT in ART doesn't seem to use any special techniques, so in my opinion, the key of performance falls on the interpreter, I would take some time researching on that part.

2016年1月11日 星期一

[Quick Note] V8 Javascript Engine's First Stage Compiler

V8 is properly, in my opinion, the fastest javascript runtime at the time(No offense SpiderMonkey, although benchmarks differ from one to another, V8 got the best grade in average). There are some well known properties that make it run so fast:

  • No interpreter, use baseline compiler instead. But this characteristic has just been subverted on the bleeding edge of the development tree. 
  • No intermediate representation(IR) in baseline compiler. But optimization(second stage) compiler do use its own IR.
  • Use method-based JIT strategy. This is fairly controversial, but as we can see later, use function as its compilation scope makes lots of works easier in comparison with trace-based strategy, which spends lots of efforts on determining trace boundary.
  • Hidden class. In short, if a variable modifies its object layout, adding a new field for example, v8 create a new layout(class) instead of changing the origin object layout(e.g. Using linked list to "chain" object fields, which appears in many implementation of early era javascript runtime). 
  • Code patch and inline cache. These two feature are less novel since they have already beed used in lots of dynamic type runtime for over twenty years.
I'm going to talk about the baseline compiler. The code journey starts from Script::Compile API which as the name suggest, compiles the script. V8 has a fabulous characteristic that it can be used standalone, that's the very key which bears Node.js. You can checkout V8's embedder guide for related resource.

After several asserting checks, the main compilation flow comes to Compiler::CompileScript(compiler.cc:1427). One of the important things done here is checking code cache. (compiler.cc:1465)

    // First check per-isolate compilation cache.
    maybe_result = compilation_cache->LookupScript(
        source, script_name, line_offset, column_offset, resource_options,
        context, language_mode);
    if (maybe_result.is_null() && FLAG_serialize_toplevel &&
        compile_options == ScriptCompiler::kConsumeCodeCache &&
        !isolate->debug()->is_loaded()) {
      // Then check cached code provided by embedder.
      HistogramTimerScope timer(isolate->counters()->compile_deserialize());
      Handle<SharedFunctionInfo> result;
      if (CodeSerializer::Deserialize(isolate, *cached_data, source)
              .ToHandle(&result)) {
        // Promote to per-isolate compilation cache.
        compilation_cache->PutScript(source, context, language_mode, result);
        return result;
      }
      // Deserializer failed. Fall through to compile.
    }

If there are previous compiled code(or snapshot in v8's term), it would deserialize them from disk. Otherwise, the real compilation procedure would be triggered, namely, CompileToplevel (compiler.cc:1242). In this function, parsing information and compilation information would be setup and passed to CompileBaselineCode (compiler.cc:817). Let's ignore the Compiler::Analyze first and jump to GenerateBaselineCode.
Here comes some interesting things: (compiler.cc:808)
static bool GenerateBaselineCode(CompilationInfo* info) {
  if (FLAG_ignition && UseIgnition(info)) {
    return interpreter::Interpreter::MakeBytecode(info);
  } else {
    return FullCodeGenerator::MakeCode(info);
  }
}

Previous says that one of the advantages V8 has is entering compilation process directly without interpreting ahead. So why there are codes related to interpreter ? It turns out that V8 is also going to have a interpreter! Here is a brief introduction about Ignition, the new interpreter engine. But let's just stare on the compilation part first.

So now we comes to FullCodeGenerator::MakeCode (full-codegen/full-codegen.cc:26). There are a few important things here where we'll come back to, for now, let's focus on FullCodeGenerator::Generate .
Now here's the exciting part. FullCodeGenerator::Generate is a architecture-specific function, which residents in source code files in separated folders like arm, x64 .etc. Let's pick few lines of code from x64 version:
      if (locals_count >= 128) {
        Label ok;
        __ movp(rcx, rsp);
        __ subp(rcx, Immediate(locals_count * kPointerSize));
        __ CompareRoot(rcx, Heap::kRealStackLimitRootIndex);
        __ j(above_equal, &ok, Label::kNear);
        __ CallRuntime(Runtime::kThrowStackOverflow);
        __ bind(&ok);
      }

Pretty familiar right? It seems that it directly generates the assembly code!
Normally this sort of approach would appears in education project like homework in college compiler courses. Mature compiler framework often has lots of procedures on generating native codes. Directly generate native assembly is thus one of the most important factors that speed up V8.
Another worth mentioned thing is the "__" at the prefix of each several lines above. It's a macro defined in full-codegen/full-codegen.cc:
#define __ ACCESS_MASM(masm())
Here MASM does not means Microsoft's MASM where the first 'M' character of the latter one stands for Microsoft. Our MASM stands for macro assembler, which is responded for assembly code generating.

Back to FullCodeGenerator::Generate. VisitStatements seems to be the code emitter for js function body. Now we stop the code tracing here and step back to see the declaration of FullCodeGenerator.(full-codegen/full-codegen.h) 

class FullCodeGenerator: public AstVisitor
It turns out that the code generator itself is a huge AST visitor! And VisitStatements is the top level visitor function that would dispatch different part of code to visitors like FullCodeGenerator::VisitSwitchStatementFullCodeGenerator::VisitForInStatement or FullCodeGenerator::VisitAssignment to name a few.

Last but not the least, some people say Ignition, the interpreter mentioned before, is not like SpiderMonkey or JavascriptCore(Safari). I'm really curious about that, maybe I would wrote another article about that : )