Contact
Search
Essays

Entries in Better Know a Virtual Machine (26)

Monday
Aug092010

Sea of Nodes Compilation Approach

Cliff Click's thesis entitled Combining Analysis, Combining Optimizations, represents a program not as a control flow graph, but as a data flow analysis graph. I've been reading it, and have written a summary. If you're interested, download it here. This compilation method is used in the Java Hotspot Server Compiler.

 

Thanks to Christian Wimmer for proofreading and feedback.

Friday
Mar122010

Have tracing JIT compilers won - notes

There is a really awesome discussion on whether or not trace compilers have "won" at Lambda the Ultimate. It's pretty dense so here's some background information and synopsis to help follow.

All the comments center on how to pair trace compilation with other execution techniques (e.g. method at a time compilation, interpretation). There's a bit of tracing background required to fully understand everything. The basic problem with tracing is that it fails when you have really branchy code. If you try to "stay on a trace" (stay in JIT compiled code instead of switching to a different execution technique), you'll lose overall because you'll be spending time compiling code instead of making forward progress on the actual application. You'll also explode your code cache and there will be a ton of useless traces laying in memory.

Thomas Lord's analysis is spot on. His basic premise is that tracing is great in specific cases. However, a VM needs multiple compilation strategies. Also, it's really difficult to create the "best" code. There are lots of ways to profile the running code and trying to find out some algorithm that can detect what the most optimal code is, but it's hard. I agree with him when he says that it will be very difficult to find the most optimal code. There is a pretty cool paper at CGO 2010 about solving this problem.

Ben Titzer and Andreas Gal discuss traditional compilation vs trace compilation as a whole, and in what situations a trace compiler will be relevant. Brendan Eich backs up the idea that trace compilation is a viable compilation strategy. Android uses trace trees which unlike dynamo, connects multiple traces together at a singe point. The bottom line is when traces work, they work really well. 

That's when Mike Pall, the creator of LuaJIT, chimes in. Generally, LuaJIT is considered to be one of the fastest, if not the fastest, dynamic language VM. The links in that specific post are very interesting and relevant. LuaJIT has a really fast interpreter with a trace compiler on top. The resulting subthread is the most interesting. 

Peter proposes tracing native code. Mike says that instead of tracing native code, you should just make a really fast interpreter and add a tracing JIT on top. Mike thinks that having three execution engines (interpreter, method JIT, and a tracing JIT) is too complicated. Brendan agrees and says all you really need is a generic method JIT and a tracing JIT on top. Also, PICs are polymorphic inline caches and are pretty much used by everyone. At this point, the subthread moves onto having a method + tracing JIT and how can you trace the generic method JIT code. In my opinion, a generic method JIT + tracing JIT on top is the way to go. LuaJIT takes the approach of the fast interpreter + trace compiler route

Another interesting note Brendan brings up is to get a simpler language. Brendan's main argument about LuaJIT is that it traces a lot less code than TraceMonkey does because Lua is simpler. This is where a fundamental issue with tracing comes up. What kind of heuristics do you need to decide when you should trace something versus staying in generic slower code? If you keep jumping out of traces you lose. Brendan seems to be taking the position that you should try to trace more code. 

Mike Pall thinks it's better to have a really fast baseline and trace when you can guarantee a performance improvement. He thinks trace trees are too complicated without any real performance gain. LuaJIT takes a bunch of traces and sticks them together wherever they appear. Trace trees must be linked at the same program location. Mike argues that trace trees only win if you recompile all the traces together to optimize them together. However, in the context of the web, the delay of recompiling everything is too severe and so nobody does it. I really like his idea of cross-trace register coalescing - putting values into registers that are known across traces. This makes switching between traces a lot faster and solves a real fundamental issue UC Irvine had with traces. Trace nesting is having a trace tree as part of another trace tree (inner loops). Brendan seems to agree that sticking traces together whenever they connect is the way to go, but trace trees are useful somtimes. This is also where Brendan comments about the complexity of JavaScript compared to Lua comes in

As a random tangent, Lius Gonzalez talks about PyPy. PyPy tries to be a VM for all languages. It has an interpreter that executes a target application programming language. They then trace the internal PyPy interpreter loop to optimize the application programming language. This is a lot like Scott Peterson's (from Adobe) work. Alexander Yermolovich interned with Scott in 2008 and wrote a paper on it (Optimization of dynamic languages using hierarchical layering of virtual machines). They took the Lua VM (not LuaJIT I think), ran it through Alchemy, and then ran it on top of Tamarin-Tracing. 

Mike Pall responds saying that the layers add up. The most interesting is point #3, where PyPy loses all the high level information, limiting their ability to optimize the application code. I think this is generally true. ABC and LIR in NanoJIT suffer from the same problem. 

Some of the small notes come with tracing through the browser. I think Andreas Gal was talking about this a while ago. Large swaths of Firefox's UI is written in JavaScript so it's actually a very important issue for them. I can't quite follow the whole Membranes discussion. 

Another thought is whether or not it's possible to write a metacircular tracing JIT. Michael Bebenita is doing it with Maxine. He's hitting the same problem as everyone else which is what you should trace instead of doing another compilation technique. The basic premise is that when you have a method JIT, you really have to restrict your traces because you're going to win a lot less often than you would if you only have an interpreter. In fact if you trace too much, you're going to lose really fast. 

Finally, a few random notes that date back to Tamarin-Tracing. When I first interned at Adobe, Edwin Smith actually tasked me to trace native code. At the time I didn't know it, but I implemented a call threaded interpreter. What we found out was that switching from one machine code frame to another is really expensive. I'm sure we could've solved the problem, but Tamarin-Tracing was already canceled. At the moment, my mind is blown realizing that a lot of the problems with Tamarin-Tracing are coming back up on this thread. 

The bottom line is that everyone agrees you need some other kind of execution mechanism and a trace compiler on top. The unsettled questions are:

1) Should you pair a trace compiler with a really fast interpreter or a generic method at a time JIT. 
2) What and how much should you trace. 

 

Whew, I hope that helped. There are lots of interesting tangents, but I tried to focus on the tracing aspects of the post. Feel free to ask more questions if something is unclear. 

 

Mason
* I'll post updates as the thread progresses.
Tuesday
Dec082009

Constructing SSA the Easy Way

My good colleague of mine, Michael Bebenita, has written a short paper on how to construct Static Single Assignment, a common way to represent programs in optimizing compilers. Check it out here.

Abstract:

Static Single Assignment (SSA) form has become ubiquitous in compilers as an intermediate program representation. The use of SSA simplifies many compiler optimizations and makes the life of compiler writers easier. Many techniques exist to convert programs into SSA form, many of which I find unnecessarily difficult. In this paper I will present a simple technique to convert programs in and out of SSA form.

Tuesday
Dec012009

Calling C++ in LIR

JIT compiled code needs to call C++ methods to do the heavy lifting. Metadata about C++ methods such as the parameters, the return type, etc is needed to create LIR that can call into those methods. Tamarin does this through the use of the CallInfo data structure:

struct CallInfo
{
    uintptr_t   _address;        // Address of the method
    uint32_t    _argtypes:27;    // 9 3-bit fields indicating arg type
    AbiKind     _abi:3;
 
    verbose_only ( const char* _name; )
};

The _address field is the address of the C++ method.

The _argtypes field is a bit encoding of the number of parameters, the types of each parameter, and the return type. For example, consider a C++ method that has the declaration:

void someFunction(int someParameter, double otherParameter);

 

Void types are represented by the decimal number 0 (binary 000). Integer types are represented by the decimal number 2 (binary 010). Double types are represented by the decimal number 1 (binary 001). The _argtypes, in binary would look like:

        010 | 001 | 000

The parameters are laid out in reverse order, with the return type being the right most 3 bits. Since there are only 27 bits, LIR can only make calls to a C++ method that have at most 8 parameters.

The AbiKind represents how the C++ method is called.

enum AbiKind {
    ABI_FASTCALL,
    ABI_THISCALL,
    ABI_STDCALL,
    ABI_CDECL
};

 

FastCall stores a few parameters into registers instead of pushing them onto the stack. A THISCALL means that the C++ method is part of a class, and requires that the instance of an object be passed in. I haven't seen STDCALL be used at all. A CDECL call stands for C declaration, where all parameters are pushed onto the stack prior to calling the method.

All these CallInfo structures are manually created and maintained in Tamarin in the file core/jit-calls.h. Consider the C++ method declaration:

static void AvmCore::atomWriteBarrier(MMgc::GC *gc, const void *container, 
Atom *address, Atom atomNew);

 

The macro declaration to create the CallInfo structure for atomWriteBarrier is:

FUNCTION(FUNCADDR(AvmCore::atomWriteBarrier), SIG4(V,P,P,P,A), atomWriteBarrier)

 

Each of these funky words is another macro:

  • FUNCTION - AvmCore::atomWriteBarrier uses the ABI_CDECL calling convention.
  • FUNCADDR - This is a static method.
  • SIG4(V,P,P,P,A) - Represents the CallInfo::_argtypes field. V is a void method. The next 3 parameters are (P)ointer types. The last parameter is an (A)tom type.
  • atomWriteBarrier - The name of the method, which is used for debugging purposes.

The current list of C++ methods all manually created and maintained, which is a really annoying hassle. With the LLVM bitcode to LIR translator, the number of CallInfos grows dramatically because C++ methods usually call lots of other C++ methods. Consider the atomWriteBarrier code:

void AvmCore::atomWriteBarrier(MMgc::GC *gc, const void *container, Atom *address, Atom atomNew)
{ 
    decr_atom(*address);
    incr_atom(gc, container, atomNew);
    *address = atomNew;
}

 

There are no CallInfo structures for decr_atom() and incr_atom(), but they have to be created. If a targeted inline C++ method calls a C++ method that doesn't already have a CallInfo structure, a new CallInfo for the newly called C++ method must be created.

While the list can manually be created, it would be horrendously annoying. Instead, we automatically create a new file called jit-calls-generated.h that contains all the new generated CallInfo structures.

Automatically creating a CallInfo structure:

Consider the C++ method declaration for decr_atom:

static void decr_atom(Atom const a);

 

The declaration in LLVMbitcode:

define void @AvmCore::decr_atomEi(i32 %a); 

 

When translating bitcode to LIR, the bitcode contains the parameter and return types. It also provides the name of the method. The last missing piece of the AbiKind.

LLVM bitcode has an explicit FASTCALL modifier for methods that use the FASTCALL calling convention. However, bitcode contains no explicit distinction between a CDECL and THISCALL calling convention. The call site in bitcode only says that a pointer is being passed into a method. The distinction between a CDECL/THISCALL is found at the function definition, NOT declaration.

The LLVM function definition for a THISCALL looks like:

define i32 @Toplevel::add2(%"struct.avmplus::Toplevel"* %this, i32 %leftOperand, i32 %rightOperand); 

 

The first parameter is explicitly named "this". If the first parameter of a function definition is named "this", then the C++ method uses the THISCALL AbiKind. This detection scheme is safe because "this" is a keyword in C++, and you can't name a value "this". The function declaration would show the name of the instance being passed in, rather than the name "this", therefore being incorrect.

What can't be called:

While the majority of C++ methods can be called from LIR, there are some limitations. A C++ object's constructor cannot be called as it is against the C++ specification to get the address of a constructor (C++ Standard, section 12.1.12). At the moment, we create a C++ wrapper that calls the constructor. LIR then calls the wrapper. A polymorphic call cannot be called as it's a runtime lookup based on a virtual method table.

Wednesday
Nov112009

Example C++ to LIR Translation

There are a lot of translation steps to go from C++ to bitcode to LIR. This post hopefully solidifies everything with a concrete example. Consider ActionScript source that adds two variables and assigns the sum to another variable.

var a;
var b;
var sum = a + b; 

 

The LIR that is normally generated is:

left = load left[0]
right = load right[0]
add = icall #add ( left, right )
 
store vars[32] = add

 

First, the two variables left and right, which are a and b in the AS3 source code, are loaded. Next, a call to the VM C++ method add is generated. The result of add is stored into LIR vars[32], which is a location in memory and represents the AS3 variable sum.

Instead of calling Add, the JIT should inline the method. To do that, C++ has to be converted to LIR. The VM C++ method that does the actual add has the source:

Atom Toplevel::add(Atom left, Atom right)
{
    BITCODE_INLINEABLE  // Indicate that we want to translate this method to LIR
    if (areNumbers(left, right)) {        
return
addNumbers(left, right);
    }
    else {
        return addUnknown(left,right);
    }
}

 

The add method is part of a C++ object named Toplevel. ActionScript values are modeled as Atoms in C++. An Atom is a 32 bit integer with the bottom 3 bits used for type information. The C++ source checks to see what the "+" operator does depending on the types of the two values. Once Tamarin is compiled with llvm, it produces bitcode that looks like:

define i32 @add2(%"struct.avmplus::Toplevel"* %this, i32 %left, i32 %right) {
entry:
    call void @enableBitcodeInlining() // The macro expansion of BITCODE_INLINEABLE
 
    %0 = call i8 @areNumbers(%"struct.avmplus::Toplevel"* %this, i32 %left, i32 %right) 
    %toBool = icmp eq i8 %0, 0      
 
    // if true go to addNumbers, otherwise addUnknown
    br %toBool, label %addNumbers, label %addUnknown    
 
addNumbers:     
    %1 = call i32 @addNumbers(%"struct.avmplus::Toplevel"* %this, i32 %left, i32 %right) 
    ret i32 %1
 
addUnknown:       
    %2 = call i32 @addUnknown(%"struct.avmplus::Toplevel"* %this, i32 %left, i32 %right) 
    ret i32 %2
}

 

The LLVM bitcode is in SSA form and retains all of the type information and control flow. The call to enableBitcodeInlining is the C++ source macro BITCODE_INLINEABLE. The static translator looks for a call to enableBitcodeInlining as an indicator to translate the method to LIR. The llvm type i32 represents an integer, which is what a C++ Atom really is. Finally, the resulting LIR once the C++ add method is inlined into the LIR instead of being called is:

left = load leftp[0]
right = load rightp[0]
 
inline( left, right )
    retVal = stackAlloc 4
    isNumberAdd = icall #areNumbers (left, right)
 
    eq1 = eq isNumberAdd, 0
    jump true eq1 -> addNumbersLabel
    jump false eq1 -> addUnknownLabel
 
addNumbersLabel:
    addNumbers = icall #addNumbers (left, right)
    store retVal[0] = addNumbers
    jump -> endInline 
 
addUnknownLabel:
    addUnknown = icall #addUnknown (left,right)
    store retVal[0] = addUnknown
    jump -> endInline 
 
endInline:
 
ld5 = load retVal[0]
sti vars[32] = ld5

 

The LIR still needs to load the left and right operands prior to inlining the method. Space is then allocated on the stack for the return value of the method. The C++ return statements become LIR store values at the allocated stack location. The C++ call statements remain LIR call statements, unless they too are explicitly inlined. In the original LIR, the value returned from call add was stored into the location vars[32]. Now, the return value is loaded from retVal and stored again into vars[32], completing the inlining of C++ add.

Although not in this example, the translator takes the same approach used by return statements for Phi functions. LIR doesn't have a Phi opcode. Instead, stores are pushed up to the basic block where the value is flowing into the Phi function. The actual Phi turns into a load from the store location.

Also, the translator always follows LLVM semantics and creates both the true and false branches in LIR. Future work is to optimize away one of the branches and add a LIR_phi instruction.