2017年6月24日 星期六

Device Interfaces in Tensorflow

Tensorflow just got overwhelming over the developer world in recent years. It enables a developer, even a newbie in the machine learning world, to build a neural network in just couple of minutes. Also, it is designed to run on various of devices and platforms, like CPU, GPU, and distributed system. In this article we're going to focus on the latter feature, to see how Tensorflow interacts with different hardware to perform those heavy computations.

Scenario

We want to find out the way to add new devices into Tensorflow in order to execute kernels on top of them.

In the ideal world, every kernel implementation is independent to the underlying device. That is, one can execute a kernel on various of devices without, or with minimum modification on the kernel code. Such that if we want to use an alternative device, say FPGA, to power the computations, all we need to do is implementing some sort of device interface(maybe a C++ class?), rather than rewriting all of the kernels. Kernels would leverage the device interface to perform critical, and more primitive(in comparison with the ML algorithm on top of that) calculations. For example, matrix multiplications.

Take 1: The tensorflow::Device class

Let's start with this SO question. The answer indicated that if we want to add a new device, we need to implement the tensorflow::Device class and register it with some macro. In that class (tensorflow/core/common_runtime/device.h) there is a worth-noting virtual function that is most likely be the place where the target-specific computation logic is implemented: Device::Compute.

  // Performs the actual compute function.
  //
  // Subclasses may override this function if they wish to perform
  // some initialization before each compute.
  virtual void Compute(OpKernel* op_kernel, OpKernelContext* context) {
    op_kernel->Compute(context);
  }

Unlike the Compute function in OpKernel class, which performs the real computation, the Compute function here acts more like a wrapper around all of the OpKernel::Compute. Just like the comment says, this function is responsible for setting up or initializing the device context before each kernel computation. Let's look at a concrete example: The BaseGPUDevice (tensorflow/core/common_runtime/gpu/gpu_device.h).

Thing we care most is the BaseGPUDevice::Compute function, and basically all of its primary computation logic would be delegated to BaseGPUDevice::ComputeHelper,  which is shown below(some verbose code have been trimmed).

void BaseGPUDevice::ComputeHelper(OpKernel* op_kernel,
                                  OpKernelContext* context) {
  GPUDeviceContext* gpu_device_context = device_contexts_[0];
  if (context->op_device_context() != nullptr) {
    gpu_device_context =
        static_cast<GPUDeviceContext*>(context->op_device_context());
  }
  gpu::Stream* stream = gpu_device_context->stream();
  //const auto stream_id = gpu_device_context->stream_id();

  const auto num_streams = streams_.size();
  if (num_streams > 1) {
    // If this op's device context is different from the other contexts,
    // we must wait on the stream.
    for (int i = 0; i < context->num_inputs(); ++i) {
      const GPUDeviceContext* idc =
          static_cast<GPUDeviceContext*>(context->input_device_context(i));

      if (idc->stream() != stream) stream->ThenWaitFor(idc->stream());
    }
  }
  gpu::cuda::ScopedActivateExecutorContext scoped_activation{stream->parent()};
  op_kernel->Compute(context);
  if (context->status().ok()) {
    if (sync_every_op_) {
      // Note: GPUUtil::Sync() only syncs the default stream.
      // We need to either sync the stream used by this op, or
      // all streams.  Given that this flag is typically used for
      // debugging it makes more sense to sync all GPU activity.
      context->SetStatus(GPUUtil::SyncAll(this));
    }
  }
}

Before calling the kernel's Compute function in line 23, this function would wait for the input arguments to finish if their (CUDA) streams are different from the kernel's (line 19).

Now it's pretty clear that this interface is NOT the one we desire in the scenario mentioned previously. It seems that the target-specific logics are implemented in the kernel. For example, the bias_op kernel is separated into two classes: Bias and BiasGPU, which are located in bias_op.cc and bias_op_gpu.cu.cc under tensorflow/core/kernels, respectively. CUDA code are hard-coded into the implementation of BiasGPU::compute (line 78 ~ 88).

  CudaLaunchConfig config = GetCudaLaunchConfig(total_count, d);
  if (data_format == FORMAT_NHWC) {
    BiasNHWCKernel<
        T><<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
        config.virtual_thread_count, input, bias, output, bias_size);
  } else {
    BiasNCHWKernel<
        T><<<config.block_count, config.thread_per_block, 0, d.stream()>>>(
        config.virtual_thread_count, input, bias, output, bias_size,
        image_size);
  }

Then in runtime, framework would pick the corresponding kernel variant depends on the launching configuration.

Take 2: The Eigen::TensorDevice class

During our inspection on tensorflow/core/kernels/bias_op_gpu.cu.cc, we found that there is an interesting thing in line 29:
typedef Eigen::GpuDevice GPUDevice;
Though instances of this type are only used for accessing CUDA stream in this file, we're still curious about the relationship between Eigen and Tensorflow here.

Eigen is a famous linear algebra library, and Tensorflow used it heavily in its codebase since there are many linear algebra calculations in ML algorithm, for example, again, matrix multiplications. The question is: How does Tensorflow take advantage of Eigen?

Before moving forward, you should know that due to some reasons related to the build system, part of the Eigen library code introduced here is not presented in the Tensorflow codebase, it's stored in here and here. I recommend you to learn where these two URLs can be found from this[1] note, in case that they may vary from version to version.

Let's look at QuantizeAndDequantizeOneScaleImpl::Compute (tensorflow/core/kernels/quantize_and_dequantize_op.h). First, we can find that this class is not divided into separated CPU and GPU variants. The Device type, which would eventually be resolved to types that based on Eigen::TensorDevice,  and argument Device& d, play the main roles here. Here is an example of how the latter one is used (line 71 ~ 76):

        out.device(d) =
            ((input.cwiseMin(max_range).cwiseMax(min_range) - min_range) *
                 scale +
             T(0.5)).floor() *
                inverse_scale +
            min_range;

out variable above is a Tensor, and the RHS expression above would also be resolved to a Tensor. Nevertheless, instead of assigning result of RHS to the out variable directly, the evaluation of RHS expression would be postponed and the entire RHS expression would be delegated to Eigen::TensorDevice::operator= . Implementations of Eigen::TensorDevice, Eigen::GpuDevice [2] for example, would be responsible for executing the RHS expression that passed in.

Summary

Now we know there are two ways to enable kernel execution on new devices:
  1. Modify the kernel source with device specific code or add another variant of that kernel (e.g Bias and BiasGPU ). 
  2. Implement another Eigen::TensorDevice. (e.g Eigen::GpuDevice )
Methods above are complementary, adopt different one depends on properties of the device and kernels. For example, if operations are strongly related to linear algebra, the second method is more adequate; otherwise, the first one might be more expressive although it might require lots of kernel modifications (not every kernels I think, since there are kernels that just can't be executed on devices other than CPU).

[1]: Tensorflow choose to use part of the Eigen library without any modification, so not until the first build would the build system fetch archived libraries files from official repository of Eigen. The aforementioned behavior is written in tensorflow/workspace.bzl, line 148.

[2]: Eigen/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceCuda.h

2017年4月28日 星期五

[Quick Tutorial] How to USE LLVM with CMake

Perhaps I hang around in the LLVM developers world for too long, that I almost forget that LLVM is a framework consist of libraries - libraries for application developers to USE.

Recently I found that there are some developers who use LLVM (and related) tools like clang and llc, have some experience surfing in the LLVM codebase, and even played around with some small hacks - But they're not sure how to USE LLVM as normal libraries during application developments.

Of course, you can just use the similar semantic below to build your application with LLVM.
g++ -I<LLVM Include Path> -L<LLVM Library Path> foo.cc -o foo -lLLVMSomeLib
But well, It's 21st, we got CMake to make C++ building process more neat and easier. And since LLVM had migrated to CMake as its default and only build system since version 3.7, it provides some great CMake modules for application developers to integrate LLVM into their projects faster and error free. Here are two examples for small and large projects.

Small Projects

I prefer to say "small executable" rather than small project, actually. Because just like what I'd described, this example is suitable for scenario that you just need to embed LLVM in single executable. Maybe a scratch program to illustrate how certain components work or you're using the utilities libraries within LLVM, command line library for example, in your executable. 

First, in your CMakeLists.txt, define LLVM_DIR variable with path to the cmake module folder within LLVM install dir(Yeah, I know the variable name is pretty confused). Usually the path is <LLVM Install Folder>/lib/cmake/llvm. After that, find the LLVM package:
find_package(LLVM REQUIRED CONFIG) 
Then there are several variables been defined for us to use. Including the include folder path:
include_directories(${LLVM_INCLUDE_DIRS}) 
And some compiler definitions:
add_definitions(${LLVM_DEFINITION})
What about libraries? Here comes the magic, a convenient cmake function is provided to resolve the burden of library names and path:
llvm_map_components_to_libnames(_LLVM_LIBS support core irreader) 
_LLVM_LIBS variable is used as output variable, so it would store all the path to the given component libraries. Support, Core and IRReader components in this case, we just need to give the component names, in lowercase, then llvm_map_components_to_libnames would translate it into LLVMCore, LLVMSupport etc.

Finally, we link those libraries to our target:
add_executable(fooProgram foo.cpp bar.cpp) target_link_libraries(fooProgram ${_LLVM_LIBS})
Here is the full code list:
cmake_minimum_required(VERSION 3.6)
project(LLVMExperiments CXX)

set(CMAKE_CXX_STANDARD 11)

set(LLVM_DIR /usr/local/opt/llvm-clang/4.0/lib/cmake/llvm)
find_package(LLVM REQUIRED CONFIG)
message(STATUS "Use LLVM ${LLVM_PACKAGE_VERSION}")
include_directories(${LLVM_INCLUDE_DIRS})
add_definitions(${LLVM_DEFINITION})

# LLVM libraries are static linking by default
# So pick only the component libraries you need
# in case your final executable becomes large in size
llvm_map_components_to_libnames(_LLVM_LIBS support core analysis irreader)

set(SOURCE_FILES
        main.cpp)
add_executable(LLVMExperiments ${SOURCE_FILES})
target_link_libraries(LLVMExperiments ${_LLVM_LIBS})

Large Projects / Loadable Components

As the section title indicates, this method is the only way I figure out how to build LLVM loadable components, especially LLVM Pass, with CMake. Also, this method is strongly recommended if you want to develop tools or components that may merge into upstream LLVM codebase in the future.

The first thing you may need to change is the project's folder structure. We're going to create a subdirectory to accommodate main code for your tools, and create another CMakeLists.txt within it. So there would be two cmake build scripts: <Project Folder>/CMakeLists.txt and <Project Folder>/<Tool Folder>/CMakeLists.txt.

Let's look at the first build script, the one in upper directory. In the previous section, "Small Project", some of you may bump into a problem that the LLVM library you use isn't built with RTTI, so you need to add "-fno-rtti" flags to compiler option in your cmake build script. But you want to make the cmake build script more general, so you add the following code:
if (NOT ${LLVM_ENABLE_RTTI})
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-rtti")
endif()

Well, that's the RTTI case...what about other similar flags? It's pretty cumbersome to add bunch of if-else statements for them. Here comes the solution:
list(APPEND CMAKE_MODULE_PATH ${LLVM_CMAKE_DIR})

set(LLVM_RUNTIME_OUTPUT_INTDIR ${CMAKE_BINARY_DIR}/${CMAKE_CFG_INTDIR}/bin)
set(LLVM_LIBRARY_OUTPUT_INTDIR ${CMAKE_BINARY_DIR}/${CMAKE_CFG_INTDIR}/lib)

include(HandleLLVMOptions)

Add the lines above after the LLVM package-finding statement. LLVM_RUNTIME_OUTPUT_INTDIR and LLVM_LIBRARY_OUTPUT_INTDIR variables are necessary for HandleLLVMOptions module, but feel free to set them with whatever path you want to output the result binaries and libraries.

Finally in our upper directory build script, add two additional statements:
include(AddLLVM)
add_subdirectory(MyToolSubDirectory)

MyToolSubDirectory is the subdirectory name for your tool, library or loadable component.
Here is the full code list for the build script in upper directory:

cmake_minimum_required(VERSION 3.7)
project(LLVMExperiments2 CXX)

set(CMAKE_CXX_STANDARD 11)

set(LLVM_DIR /usr/local/opt/llvm-clang/current/lib/cmake/llvm)
find_package(LLVM REQUIRED CONFIG)
message(STATUS "Using LLVM version ${LLVM_PACKAGE_VERSION}")

list(APPEND CMAKE_MODULE_PATH ${LLVM_CMAKE_DIR})

set(LLVM_RUNTIME_OUTPUT_INTDIR ${CMAKE_BINARY_DIR}/${CMAKE_CFG_INTDIR}/bin)
set(LLVM_LIBRARY_OUTPUT_INTDIR ${CMAKE_BINARY_DIR}/${CMAKE_CFG_INTDIR}/lib)

include(HandleLLVMOptions)
include(AddLLVM)

add_definitions(${LLVM_DEFINITIONS})
include_directories(${LLVM_INCLUDE_DIRS})

add_subdirectory(MyToolSubDirectory)

Let's go forward to the subdirectory. This subdirectory is where your code resides. The cmake build script here, however, is pretty simple. If you're building executable, here are the only few statements you need to add:

set(LLVM_LINK_COMPONENTS
        Support
        Core
        IRReader
        )

add_llvm_executable(MyExecutable
        main.cpp
        foo.cpp
        )

MyExecutable is the name for your executable. Defining LLVM_LINK_COMPONENTS variable is obviously adding component libraries to link against your executable.

For loadable components, the build script is even simpler:

add_llvm_loadable_module(MyLLVMPass
        pass.cpp
        foo.cpp
        )

That's it!

Here is a simple LLVM pass example I wrote for your reference:
https://github.com/mshockwave/LLVM-Sample-Pass