2016年1月11日 星期一

[Quick Note] V8 Javascript Engine's First Stage Compiler

V8 is properly, in my opinion, the fastest javascript runtime at the time(No offense SpiderMonkey, although benchmarks differ from one to another, V8 got the best grade in average). There are some well known properties that make it run so fast:

  • No interpreter, use baseline compiler instead. But this characteristic has just been subverted on the bleeding edge of the development tree. 
  • No intermediate representation(IR) in baseline compiler. But optimization(second stage) compiler do use its own IR.
  • Use method-based JIT strategy. This is fairly controversial, but as we can see later, use function as its compilation scope makes lots of works easier in comparison with trace-based strategy, which spends lots of efforts on determining trace boundary.
  • Hidden class. In short, if a variable modifies its object layout, adding a new field for example, v8 create a new layout(class) instead of changing the origin object layout(e.g. Using linked list to "chain" object fields, which appears in many implementation of early era javascript runtime). 
  • Code patch and inline cache. These two feature are less novel since they have already beed used in lots of dynamic type runtime for over twenty years.
I'm going to talk about the baseline compiler. The code journey starts from Script::Compile API which as the name suggest, compiles the script. V8 has a fabulous characteristic that it can be used standalone, that's the very key which bears Node.js. You can checkout V8's embedder guide for related resource.

After several asserting checks, the main compilation flow comes to Compiler::CompileScript(compiler.cc:1427). One of the important things done here is checking code cache. (compiler.cc:1465)

    // First check per-isolate compilation cache.
    maybe_result = compilation_cache->LookupScript(
        source, script_name, line_offset, column_offset, resource_options,
        context, language_mode);
    if (maybe_result.is_null() && FLAG_serialize_toplevel &&
        compile_options == ScriptCompiler::kConsumeCodeCache &&
        !isolate->debug()->is_loaded()) {
      // Then check cached code provided by embedder.
      HistogramTimerScope timer(isolate->counters()->compile_deserialize());
      Handle<SharedFunctionInfo> result;
      if (CodeSerializer::Deserialize(isolate, *cached_data, source)
              .ToHandle(&result)) {
        // Promote to per-isolate compilation cache.
        compilation_cache->PutScript(source, context, language_mode, result);
        return result;
      }
      // Deserializer failed. Fall through to compile.
    }

If there are previous compiled code(or snapshot in v8's term), it would deserialize them from disk. Otherwise, the real compilation procedure would be triggered, namely, CompileToplevel (compiler.cc:1242). In this function, parsing information and compilation information would be setup and passed to CompileBaselineCode (compiler.cc:817). Let's ignore the Compiler::Analyze first and jump to GenerateBaselineCode.
Here comes some interesting things: (compiler.cc:808)
static bool GenerateBaselineCode(CompilationInfo* info) {
  if (FLAG_ignition && UseIgnition(info)) {
    return interpreter::Interpreter::MakeBytecode(info);
  } else {
    return FullCodeGenerator::MakeCode(info);
  }
}

Previous says that one of the advantages V8 has is entering compilation process directly without interpreting ahead. So why there are codes related to interpreter ? It turns out that V8 is also going to have a interpreter! Here is a brief introduction about Ignition, the new interpreter engine. But let's just stare on the compilation part first.

So now we comes to FullCodeGenerator::MakeCode (full-codegen/full-codegen.cc:26). There are a few important things here where we'll come back to, for now, let's focus on FullCodeGenerator::Generate .
Now here's the exciting part. FullCodeGenerator::Generate is a architecture-specific function, which residents in source code files in separated folders like arm, x64 .etc. Let's pick few lines of code from x64 version:
      if (locals_count >= 128) {
        Label ok;
        __ movp(rcx, rsp);
        __ subp(rcx, Immediate(locals_count * kPointerSize));
        __ CompareRoot(rcx, Heap::kRealStackLimitRootIndex);
        __ j(above_equal, &ok, Label::kNear);
        __ CallRuntime(Runtime::kThrowStackOverflow);
        __ bind(&ok);
      }

Pretty familiar right? It seems that it directly generates the assembly code!
Normally this sort of approach would appears in education project like homework in college compiler courses. Mature compiler framework often has lots of procedures on generating native codes. Directly generate native assembly is thus one of the most important factors that speed up V8.
Another worth mentioned thing is the "__" at the prefix of each several lines above. It's a macro defined in full-codegen/full-codegen.cc:
#define __ ACCESS_MASM(masm())
Here MASM does not means Microsoft's MASM where the first 'M' character of the latter one stands for Microsoft. Our MASM stands for macro assembler, which is responded for assembly code generating.

Back to FullCodeGenerator::Generate. VisitStatements seems to be the code emitter for js function body. Now we stop the code tracing here and step back to see the declaration of FullCodeGenerator.(full-codegen/full-codegen.h) 

class FullCodeGenerator: public AstVisitor
It turns out that the code generator itself is a huge AST visitor! And VisitStatements is the top level visitor function that would dispatch different part of code to visitors like FullCodeGenerator::VisitSwitchStatementFullCodeGenerator::VisitForInStatement or FullCodeGenerator::VisitAssignment to name a few.

Last but not the least, some people say Ignition, the interpreter mentioned before, is not like SpiderMonkey or JavascriptCore(Safari). I'm really curious about that, maybe I would wrote another article about that : )


2016年1月7日 星期四

[Quick Note] Type instances in LLVM IR

The story starts while I was writing a LLVM module pass few days ago.
I was trying to add null pointer initializer for a global variable, the ConstantPointerNull class need a PointerType as parameter, but I accidentally assigned a wrong address space (we're target on CL language) while creating PointerType instance. Then I found the error message pretty interesting:
opt: /my/llvm/path/lib/IR/Globals.cpp:209: void llvm::GlobalVariable::setInitializer(llvm::Constant*): Assertion `InitVal->getType() == getType()->getElementType() && "Initializer type must match GlobalVariable type"' failed.
The question is: Why comparing two pointers?  Because when we're talking about representation for Type, the straightforward way would be a simple Type(or its derived class) class instance. So why it compares two pointers instead of instances?

The first possibility came to my mind was operator overloading on "==" for Type*. But sadly, there isn't such overloading for Type and its derived classes. Finally, it turns out that LLVM "caches" all of the Type instances in the LLVMContext class(actually LLVMContextImpl, which is not exported). One of the evidences is the static factory methods for creating Type instances, for example, getInt8PtrTy:
static PointerType * getInt8PtrTy (LLVMContext &C, unsigned AS=0)
We need to pass a LLVMContext instance as a parameter. So when creating Type instances, it would grab from LLVMContext instead of creating a new one. For example, the FunctionType factory method(/lib/IR/Type.cpp: 360~379):
 
// FunctionType::get - The factory function for the FunctionType class.
FunctionType *FunctionType::get(Type *ReturnType,
                                ArrayRef<Type*> Params, bool isVarArg) {
  LLVMContextImpl *pImpl = ReturnType->getContext().pImpl;
  FunctionTypeKeyInfo::KeyTy Key(ReturnType, Params, isVarArg);
  auto I = pImpl->FunctionTypes.find_as(Key);
  FunctionType *FT;

  if (I == pImpl->FunctionTypes.end()) {
    FT = (FunctionType*) pImpl->TypeAllocator.
      Allocate(sizeof(FunctionType) + sizeof(Type*) * (Params.size() + 1),
               AlignOf<FunctionType>::Alignment);
    new (FT) FunctionType(ReturnType, Params, isVarArg);
    pImpl->FunctionTypes.insert(FT);
  } else {
    FT = *I;
  }

  return FT;
}
It first get the LLVMContextImpl instance, pImpl, and retrieves the storage for FunctionType, FunctionTypes, which declaration is here(/lib/IR/LLVMContextImpl.h: 1003~1004):
 
typedef DenseSet<FunctionType *, FunctionTypeKeyInfo> FunctionTypeSet;
FunctionTypeSet FunctionTypes;
I'm not pretty sure why LLVM want to do this. I guess they want to reduce the memory footprint due to the heavy usages of Type class in IR operations.


2016年1月2日 星期六

[Quick Note] OrcJIT in LLVM

JIT(Just in Time) compilation becomes more and more popular in recent years due to the rising demand on higher dynamic type language performance, Javascript and Python, to name a few.

However, LLVM, which is developed mainly for static type language like C/C++, also joined JIT's battlefield few years ago. Although applying LLVM on dynamic compilation has some drawbacks, for example, long running optimizations increase latencies and inadequate type system which is originally developed for static type rather than dynamic type. It doesn't stop the community from working on this subsystem. More and more enhancements come up like native support for patch points and stack maps, which debut in version 3.8 as experiment features.

There are mainly two kinds of compilation strategies: method base and trace base. Difference between them falls on the compilation scope they adopt. As their name suggests, method-based approach take functions or methods as minimum compilation units. Google's V8 Javascript engine is a well-known example. Trace-based JIT use traces, where we can treat them as a range of control flow, as the compilation units. Firefox's TraceMonkey and Lua's JIT use this kind of approach. Which of them is better may required several academic papers to explain and the debate still goes on now.

Back to LLVM, the main JIT implementation is called MCJIT. It adopted the popular MC- framework within LLVM, which is kind of LLVM's own assembler, as its backend to gain more flexibility. MCJIT use non of strategies mentioned above, instead, it use LLVM's Module as the compilation unit. You can use the llvm::EngineBuilder class to build a llvm::ExecutionEngine, which is the main JIT interface, then you can either retrieve compiled symbols or run compiled functions directly by its rich APIs like getPointerToFunction or runFunction.

There are several drawbacks on using module as compilation unit, in short, module scope is too big in comparison with method and trace, which may spend too much time on compilation procedure and increase latency. And that's what OrcJIT tries to fix.

Orc is the shorthand of on-request-compiling. As the name suggests, it lazily performs the compilation process of each functions within the module until it is used. OrcJIT also build from scratch instead of building on MCJIT. Even more, it introduced a concept called Layer. Layer is like LLVM's IR pass, but handles compilation procedure rather than IR. It provides flexibilities for building your own JIT compilation flow. Nevertheless, OrcJIT still lacks of central layer manager similar to llvm::PassManager and even common layer interface! In another word, we can only construct our own compilation "stack" by passing the base layer instance to the first argument of another layer constructor and non of the layer class inherit a common parent class as interface. For example, the lli program construct its compilation stack as below:
OrcLazyJIT(std::unique_ptr<TargetMachine> TM,
             std::unique_ptr<CompileCallbackMgr> CCMgr,
             IndirectStubsManagerBuilder IndirectStubsMgrBuilder,
             bool InlineStubs)
             : TM(std::move(TM)), DL(this->TM->createDataLayout()),
             CCMgr(std::move(CCMgr)),
             ObjectLayer(),
             CompileLayer(ObjectLayer, orc::SimpleCompiler(*this->TM)),
             IRDumpLayer(CompileLayer, createDebugDumper()),
             CODLayer(IRDumpLayer, extractSingleFunction, *this->CCMgr,
                 std::move(IndirectStubsMgrBuilder), InlineStubs),
             CXXRuntimeOverrides(
                 [this](const std::string &S) { return mangle(S); }) {}

The compilation flow would run as the order: ObjectLayer, CompileLayer, IRDumpLayer, CODLayer. Where CODLayer is the instance of CompileOnDemandLayer and treated as the highest level class in OrcJIT. We would use CODLayer to add module by addModuleSet and invoke findSymbol to retrieve compiled function symbols. OrcJIT would differ the compilation of each function until we called findSymbol.

Pros of OrcJIT would be the decreasing compilation region, brings potential to add profiler for detecting hot method which is used in merely every dynamic compilation framework nowadays. On the other hand, OrcJIT does not use the llvm::ExecutionEngine interface. Although it provides llvm::orc::OrcMCJITReplacement as an wrapper,  the main author tends to use its own interface, this raises lots of questions and arguments on the mailing list from the debut of OrcJIT. What's more, as previous mentioned, OrcJIT's layer still has no common interface and layer manger.

I was pretty excited when I saw OrcJIT, because smaller compilation unit is what trace-based JIT overwhelms method-based JIT. It looks that LLVM gets another armor in the battlefield of dynamic compilation.