Twitter
Essays
Random

Entries in Better Know a Virtual Machine (32)

Monday
Nov072011

A Look at Go Locks

Let's take a look at how GO implements locks in the actual runtime. After Go's 6g compiler generates x86 code (in file.6), it calls into Go's runtime, which is linked in once a 6l to finally generate a binary executable. For example, any Go routine finally calls the scheduler, which locks the scheduler object, and executes the Go routine. How are basic locks implemented in Go? Let's look at the lock() routine:

   1:  runtime::lock(Lock *l)
   2:  {
   3:      if(m->locks < 0)
   4:          runtime::throw("lock count");
   5:      m->locks++;
   6:   
   7:      if(runtime::xadd(&l->key, 1) > 1) {    // someone else has it; wait
   8:          // Allocate semaphore if needed.
   9:          if(l->sema == 0)
  10:              initsema(&l->sema);
  11:          runtime::mach_semacquire(l->sema);
  12:      }
  13:  }

First off, Go uses semaphores as the global locking scheme. The number of semaphores depends on the number of CPUs on the host machine. Let's take a look at the Lock data structure:

   1:  struct    Lock
   2:  {
   3:      uint32    key;
   4:      uint32    sema;    
   5:  };

The Lock data structure just keeps count of the number of available resources. Next Lock() attempts to grab the semaphore via a call to xadd(). Let's see what xadd does:

   1:  asm.s
   2:  // uint32 xadd(uint32 volatile *val, int32 delta)
   3:  // Atomically:
   4:  //    *val += delta;
   5:  //    return *val;
   6:  TEXT runtime::xadd(SB), 7, $0
   7:      MOVQ    8(SP), BX
   8:      MOVL    16(SP), AX
   9:      MOVL    AX, CX
  10:      LOCK
  11:      XADDL    AX, 0(BX)
  12:      ADDL    CX, AX
  13:      RET

Great, now we can finally trace the call stack to see what's actually happening down at the machine level. Amazing that they hand wrote the assembly. There are two interesting instructions here: LOCK and XADDL. How do these things work? From the Intel assembly manual:

LOCK - Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multi-processor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted.

So the LOCK instruction will atomically lock the next instruction, which in this case is the XADDL. Semaphores have counters based on the number of available resources. LOCK will make the XADDL atomic. Conceptually, we are locking the semaphore counter, then atomically adds a number to it.

XADDL - Exchanges the first operand (destination operand) with the second operand (source operand), then loads the sum of the two values into the destination operand. The destination operand can be a register or a memory location; the source operand is a register.

XADDL increments the semaphore. Notice that the hand written assembly code actually ensures that the add occurs on the AX register, which is where the return value must be. The only question is why are we adding a number instead of subtracting? If nobody has the lock, the value in l->key should be 0. Adding 1 to it means we can grab the key (1 > 1 test fails). If the value is greater than 1, it means multiple processes are waiting for the resource. It doesn't seem to be asking the question, how many resources are left? Instead it's asking how many processes are waiting on me?

   1:  thread.c
   2:  void
   3:  runtime::mach_semacquire(uint32 sem)
   4:  {
   5:      int32 r;
   6:   
   7:      while((r = runtime::mach_semaphore_wait(sem)) != 0) {
   8:          if(r == KERN_ABORTED)    // interrupted
   9:              continue;
  10:          macherror(r, "semaphore_wait");
  11:      }
  12:  }

Go finally calls the OS X semaphore wait() routine to grab the lock on the semaphore. So it looks like we first grab a Lock on the lock object, which finally locks the semaphore from the OS by calling semaphore_wait_trap(). This makes sense because Go locks go routines, which are mapped onto multiple OS threads. What I think is happening is that the Go runtime locks an application object first (e.g. the scheduler). Then, it tries to grab the semaphore for the currently running thread (I think), at which point the code requesting the lock is free to execute.

   1:  sys.s
   2:  // uint32 mach_semaphore_wait(uint32)
   3:  TEXT runtime::mach_semaphore_wait(SB),7,$0
   4:      MOVL    8(SP), DI
   5:      MOVL    $(0x1000000+36), AX    // semaphore_wait_trap
   6:      SYSCALL
   7:      RET

Unlocking

Now let's see how GO unlocks. It's pretty much the same thing in reverse. The unlock procedure first tries to grab the Lock object and subtract one from the lock count.

   1:  void
   2:  runtime::unlock(Lock *l)
   3:  {
   4:      m->locks--;
   5:      if(m->locks < 0)
   6:          runtime::throw("lock count");
   7:   
   8:      if(runtime::xadd(&l->key, -1) > 0) {    // someone else is waiting
   9:          // Allocate semaphore if needed.
  10:          if(l->sema == 0)
  11:              initsema(&l->sema);
  12:          runtime::mach_semrelease(l->sema);
  13:      }
  14:  }

Now let's checkout mach_semrelease() , which performs a syscall to semaphore_signal()

   1:  thread.c
   2:  void
   3:  runtime::mach_semrelease(uint32 sem)
   4:  {
   5:      int32 r;
   6:   
   7:      while((r = runtime::mach_semaphore_signal(sem)) != 0) {
   8:          if(r == KERN_ABORTED)    // interrupted
   9:              continue;
  10:          macherror(r, "semaphore_signal");
  11:      }
  12:  }

And the final source for mac_semaphore_signal():

   1:  sys.s
   2:  // uint32 mach_semaphore_signal(uint32)
   3:  TEXT runtime::mach_semaphore_signal(SB),7,$0
   4:      MOVL    8(SP), DI
   5:      MOVL    $(0x1000000+33), AX    // semaphore_signal_trap
   6:      SYSCALL
   7:      RET

Now we can lock and unlock all we want.

Tuesday
Nov012011

Inspecting GO's Compiled Code

I wanted to learn a few things about concurrency before I graduate so I'm on a mission to look at Go and Rust. First stop is Go. Go has a lot of cool features, but the most interesting are related to goroutines and channels. I also want to see how they make goroutines and channel operations fast. And of course their compiler.

Their compiler first parses GO source and compiles x86 machine code as the VM traverses the AST. It's a surprisingly simple compiler that does not seem to perform a whole lot of optimizations. It very much has the flavor of a subroutine threaded interpreter. Lots of calls to the C runtime to do most of the heavy lifting. Consider the following GO source code:

   1:  func sum(a []int, c chan int) {
   2:      sum := 0
   3:       for _, v := range a {
   4:               sum += v
   5:       }
   6:       c <- sum  // send sum to c
   7:  }

Given an array of integers, iterate over the array and sum up all the elements in the array. Instead of returning the value of sum, we send it over a channel. Now let's take a look at the generated machine code:

   1:  JMP     3 // jump into loop
   2:  JMP     18 // Jump out of loop
   3:   
   4:  // Compare if we're done with loop
   5:  INCL    AX
   6:  CMPL    AX,DI
   7:  JGE     2
   8:   
   9:  // Loop body
  10:  MOVL    (CX),BP
  11:  ADDQ    $4,CX
  12:  ADDL    BP,SI
  13:   
  14:  // Jump back to loop header
  15:  JMP     5
  16:   
  17:  // Past loop
  18:  MOVQ    $type.chan int+0(SB),(SP)
  19:  ...
  20:  CALL    ,runtime.chansend1+0(SB)

It's a very weird control flow graph. Most loops look like this:

But go adds another block:

Go's loop exit condition actually jumps to another jump instruction which finally jumps to the point after the loop. What's going on? Why would there be a second jump unless it's just bad optimization or a bug. Maybe it's because the for loop is built to iterate over both the key and value of the array, not just the value? There's another block to test something on the keys? Let's try changing the GO source code to:

   1:  for i, v := range a {
   2:      sum += v
   3:      sum += i
   4:  }

And the result:

   1:  JMP     3
   2:  JMP     18

We still have the double jump.

Sunday
Oct302011

GO's 6g Front End Compiler

I've been looking a bit through GO's 6g compiler. Their compiler starts by parsing the AST and generating machine code at each node which is later linked by 6l. The 6g / 6l names come from the Plan 9 compiler. Once they start parsing the AST, they use the Sethi-Ullman algorithm to minimize the number of registers used to generate code. 

I'm really struck by three things when looking at the GO compiler. First, why are there so many single letter variable names? It must be an old school C hacker thing. Second, I wish Eclipse + CDT had a better debugger. Finally, I know the language is called go, but why so many gotos? I've never seen a program make such judicious use of goto.

Thursday
Apr212011

Tamarin on LLVM - More Numbers

After much tweaking, profiling, and hacking, we finally have a decent range of benchmarks and performance numbers of running Tamarin with LLVM's JIT. The initial performance of LLVM's JIT, which showed a neutral result, stay the same across both the V8 and Sunspider benchmark suites across fully typed, partially typed, and untyped ActionScript code.

Methodology

Each benchmark is compared against NanoJIT on fully type annotated ActionScript code. First, we fully type annotated the ActionScript source and looped the benchmark so that execution takes 30 seconds with NanoJIT. We then modified each of the partially typed and untyped benchmarks to execute the same number of loop iterations. All evaluations are performed on a 32 bit version of Windows 7, using an Intel Core 2 Duo E8400 3.0 Ghz Processor and 4 gigs of ram.

We have three test configurations:

  • An unmodified tamarin-redux branch with NanoJIT.
  • A base TESSA configuration which performs no high level optimizations and is a straightforward translation from TESSA IR to LLVM IR.
  • An optimized TESSA Configuration which does primitive method inlining, type inference, and finally dead code elimination. The optimized TESSA IR is then translated to LLVM IR. 

Every benchmark graph in the following sections are always compared to NanoJIT on typed code. A value of 100% indicates that the configuration is equal to NanoJIT on typed code. A value less than 100% means that the configuration is slower than NanoJIT and a value greater than 100% means that the configuration is faster than NanoJIT. All images are thumbnails and can be clicked on for a bigger version.

Full Disclosure: The v8/earley-boyer, and the sunspider benchmarks regexp-dna, date-format-xparb, string-base64, string-tagcloud, and s3d-raytrace are not represented in the following charts. V8 earley-boyer, regexp-dna, and string-base64 had known verifier bugs and are fixed in an updated tamarin-redux branch. The sunspider benchmarks date-format-xparb, and string-tagcloud use eval, which is not supported in ActionScript. The s3d-raytrace benchmark has a known JIT bug in the Tamarin-LLVM branch.

Fully Typed Code

When we look at the geometric mean across all test cases, we see that the performance numbers aren't really moving that much. However, looking at the individual test cases show a very wide sway either way. Overall, LLVM's JIT is roughly equal to NanoJIT. Let's first look at the big wins.

 Sunspider bitwise-and is 2.6X times faster with LLVM than with NanoJIT because of register allocation. The test case is a simple loop that performs one bitwise operation on one value. LLVM is able to keep both the loop counter and the bitwise and value in registers while NanoJIT must load and store each value across each loop iteration. v8/Crypto's performance gains originate from inlining the number to integer, which is detailed here. We see that type inference buys us a little more performance because we keep some temporary values in numbers rather than downcasting them to integers, buying a little more performance.

The biggest loss comes in the sunspider controlflow-recursive benchmark, which calculates the Fibonacci number. At each call to fib(), the property definition for fib() has to be resolved then called. However, the property lookup is a PURE method, which means it has no side effects, and can be common sub expression eliminated. LLVM's current common subexpression eliminator cannot eliminate the lookup call and must execute twice as many lookups as NanoJIT, resulting in the performance loss.

The other big hit is with v8 / splay, where type inference actually makes us lose performance. This originates from a fair number of poor type inference decisions in our type inference algorithm, creating more expensive type conversions than necessary. For example, two variables are declared an integer, but we determine one value is an integer and the other an unsigned integer. To actually compare the two variables, we have to convert both to floating point numbers and perform a floating compare.

Partially Typed Code

Partially typed code has all global variables and method signatures fully type annotated. All local variables in a method remain untyped. Here are the results:

 The most striking thing is that you instantly lose 50% of your performance just because of the lack of type information. Thankfully, some basic type inference is able to recover ~20% of the performance loss. However, the same story arises. NanoJIT is able to hold it's own against LLVM's JIT as both have roughly equal performance. We also start to see the payoffs with doing high level optimizations such as type inference.

The big wins again come in the bitops benchmarks for the same reason LLVM won in the typed benchmark. LLVM has a better register allocator and can keep values in registers rather than loading / storing values at each loop iteration. TESSA with type inference is able to win big on the sunspider/s3d and sunspider/math benchmarks for two key reasons. First, we're able to discern that some variables are arrays. Next, each array variable is being indexed by an integer rather than an any value. We can then compile array access code rather than generic get/set property code, enabling the performance gain. We also win quite a bit in v8/deltablue because we inline deltablue's many small methods and infer types through the inlined method.

However, we suffer big performance losses in cases such as access-nbody because the benchmark is dominated by property access code. Once we cannot statically bind any property names, property lookup and resolution become the major limiting factor in the benchmark. Finally, the biggest performance loss due to TESSA optimizations occurs in math-cordic for an unfortunate reason. We are able to determine one variable is a floating point number type, while another variable remains the any type. To compare the two values, we have to convert the number type to the any type at each loop iteration. The conversion also requires the GC to allocate 8 bytes of memory. So we suffer performance both from the actual conversion itself and the extra GC work required to hold the double.

Finally, we see that some benchmarks don't suffer at all from the lack of type information such as the string and regexp benchmarks. These benchmarks do not stress JIT compiled code and instead only stress VM C++ library code.

Untyped Code

Untyped code is essentially JavaScript code with all variables untyped. And the results are:

 Overall, you lose another 5-10% going from partially typed code to untyped code, and our high level optimizations aren't able to recover as much as we hoped. The general trend however is the same: LLVM's JIT and NanoJIT have essentially the same performance without high level optimizations. The big wins occur on the same benchmarks, and the performance wins/losses occur on the same tests for the same reasons, just all the performance numbers are lower.

LLVM Optimization Levels

When I first saw the results, I thought I must really be doing something incredibly wrong. I read the LLVM docs, Google FUed my way around to see what knobs LLVM had that could change performance. While most gcc flags work with LLVM, the most important one was the optimization level. LLVM has four optimization levels that generally correspond to GCC's -O flag: NONE (-O0), LESS (-O1), DEFAULT (-O2), and AGGRESSIVE (-O3). Here are the results of tweaking LLVM's optimization level with the V8 suite:

All results are against the baseline NONE level. We see that LLVM's optimizations are improving performance quite a bit, especially in the crytpo benchmark. However, after the LESS optimizations, we unfortunately, aren't seeing any performance gains. Thus, all benchmarks shown previously used the LESS optimization levels.

What next?

Overall, we see that LLVM's JIT isn't buying us much performance in the context of the Tamarin VM. Our experience leads us to the same conclusion as Reid Kleckner with his reflections on unladen-swallow. You really need a JIT that can take advantage of high level optimizations. We'll probably leave LLVM's JIT and convert TESSA IR into NanoJIT LIR and improve NanoJIT as much we can.

Wednesday
Nov102010

Tamarin on LLVM

LLVM is a great and mature piece of software with lots of support from numerous avenues. Naturally, one of the many ideas that floats around is how would Flash perform if the backend was LLVM instead of NanoJIT? [1]. LLVM performs many more optimizations than NanoJIT and has proven to be rock solid. NanoJIT was built to compile code fast [2], not necessarily the best performing. In theory, LLVM should easily beat NanoJIT right? My summer internship was to find out.

Architecture

There are a couple of ways to replace the backend with LLVM. The first would be a direct one to one translation from an input ABC file to LLVM IR. From there, LLVM could take care of the rest. This would be the most straightforward method of comparing NanoJIT with LLVM as NanoJIT translates ABC to LIR, then compiles x86 machine code. However, we would also waste an opportunity to do a nice and clean architecture that would enable us to plugin multiple backends.

Instead, we convert an ABC file into a new intermediate representation in SSA. This representation is code named Type Enriched SSA (TESSA), which is a new object oriented IR that will retain and use all the high level information about an ActionScript program. The TESSA IR itself is very much a work in progress. So far, there isn't any type enriching occurring, and you can think of TESSA as an object oriented instruction hierarchy in SSA for the rest of this post. Once the ABC is translated to TESSA, we "lower" the TESSA IR into LLVM IR, at which point LLVM generates native machine code and can perform all of it's optimizations. The semantics of the LLVM IR we create are almost identical to the NanoJIT LIR that is created in the normal Tamarin branch.

Results

LLVM is an object oriented approach to compiler construction and thus takes a bit longer to generate machine code than NanoJIT. LLVM has virtually the same amount of time to startup as NanoJIT, but compiling each method takes longer, which adds up over the course of a benchmark. This is a problem since the V8 suite runs in a few hundred milliseconds. Since we really are interested in the performance of the generated machine code rather than compilation time, we need to minimize the time spent compiling methods. We can do this by looping each V8 benchmark a few thousand times so that each benchmark takes about 30 seconds to execute. All test cases have manual type annotations as well. All tests were run on a mid 2010 Macbook Pro 2.66 Ghz Core i7, running on Windows 7 64 bit with LLVM 2.8. The results are:

We see a rather neutral result. Across the six benchmarks, LLVM wins or loses by a very small percentage, some of which is system noise. LLVM wins big on Richards but loses about the same amount on Crypto. What's really going on?

After some runs on VTune, we find that both Splay and RegExp spend most of their time in the VM and spend less than 2% of total execution time in jitted code. Raytrace spends 10% of it's time in jitted code and we see that LLVM is a little bit faster than NanoJIT. This means that LLVM can indeed produce better performing jit compiled code than NanoJIT. DeltaBlue, Richards, and Crypto spend around 30-50% of total execution time in jit compiled code and are the only test cases that are interesting to study.

DeltaBlue shows barely any difference because the benchmark contains many small method calls. Each method only performs a small calculation such as a comparison on a field or one branch. The execution time is dominated by the overhead of the method calls and therefore the types of optimizations LLVM performs won't help much because there isn't much to optimize. 

Richards is an interesting test because this is where LLVM wins by a decent margin. Here LLVM's optimization passes do quite a bit, with the biggest win coming from the fact that LLVM IR is in SSA while NanoJIT isn't. NanoJIT has to spill values out of registers in loops while LLVM is smart enough to keep them in registers, reducing the number of loads and stores in the hottest method by 25%. The whole +10% can't be attributed to one loop, but LLVM makes small gains across all the methods in Richards. 

Crypto is dominated by one method that performs numerous arithmetic operations in a tight loop in one method that dominates 90% of the execution time in jit compiled code. LLVM is losing here because of an unfortunate instruction scheduling decision which causes 65% more CPU pipeline stalls. However, after inlining the number to integer conversion, and applying Werner Sharp's patch to inline the vector get/set methods, LLVM's code no longer has the pipeline stall. NanoJIT and LLVM hit parity in this case.

Overall, the results are great for NanoJIT. It also means that we have to work on other parts of the VM until Tamarin sees big improvements on the V8 suite. Be forewarned, this isn't a conclusive experiment in any way shape or form, and evaluating a whole back end on 4 test cases isn't overwhelming evidence. As we build up TESSA to support more benchmarks, we may find LLVM pull ahead by a wide margin. Until then, it looks like a rather neutral result.

  1. LightSpark by Alessandro Pignotti is an open source Flash Player implementation that uses LLVM.
  2. Below is a comparison of compilation time of LLVM versus NanoJIT. This chart is measured in multiples, not percentage. (Eg Crypt compiles 6x slower with LLVM than NanoJIT). Note that the compilation time cannot be completely attributed to LLVM. The TESSA IR heavily uses the visitor pattern to traverse the object graph which contributes significantly to the compilation time. I don't have any hard numbers on how much time is spent in LLVM versus the TESSA IR.

    I interpret this finding as LLVM isn't necessarily slow versus NanoJIT compiles extremely fast. For example, the Richards benchmark takes 0.035 seconds to execute one iteration with NanoJIT. LLVM takes 0.110.

Updated to include compilation time.