Stopping All Threads in a VM

Why would you ever want to stop all the threads in a VM? If you've ever read anything about garbage collection, you might have heard about stop the world garbage collectors. It means pause execution of the running program, do some VM system specific tasks, and then continue execution. Another instance is if you want to JIT compile some code and deoptimize [paper] back to a normal execution state. In order to invalidate any code or JIT compile any code, you have to ensure that no threads are currently executing in the block of memory you will overwrite. How do we do this? In this post, I'm going to focus on the use case where you want to overwrite a block of memory and patch in new JIT compiled code.

Conceptually there are two ways to stop all threads in a running system. Both methods require the use of signals, mprotecting some area of memory, and catching the thrown signal. You can either a) Mark the target page as noexecute, or b) instrument code to load from a page of memory and read protect that page when you want to stop the thread. Once all the threads are stopped, you are now free to change the page however you want.

Why does this work? Essentially, every thread has it's own memory and each thread is executing at a different program counter (PC) location. All we really need is to ensure that each thread's current PC is not in a memory page we want to change. By making a thread catch a signal, the thread's PC jumps to a signal handler, evicting the current thread out of the target memory page. Now all that's left is implementing it!

Conceptually, we have five steps:

  1. Install the signal handler
  2. mprotect the target memory page 
  3. Catch the signal (stops all threads) 
  4. Alter the code in memory 
  5. Resume all threads

Installing the Signal Handler

For the following example, we're going to do what the HotSpot Java VM and Maxine JVM do: Assume code always performs a load from memory a dedicated page. When we want to raise the signal, we read protect the page, thereby raising the signal. More on why we use this technique will be in the next section.

There are two methods you can use to install a signal: SIGNAL or SIGACTION. From what I've been reading on the internet, SIGNAL is deprecated in Linux and it's better to use SIGACTION. SIGACTION is also more flexible than SIGNAL because you have more control about what happens when a signal is thrown.

The other piece of information we need is to know which signal to catch. A load error on OSX and Linux both throw the SIGSEGV signal. If you've ever seen a segfault crash, it's because the program didn't catch a SIGSEGV signal. To install a signal handler for SIGSEGV, let's use the following code snippet:

   1:  void installSignalHandlers() {
   2:      struct sigaction signalInfo;
   3:      signalInfo.sa_sigaction = catchSignal;
   4:      sigemptyset(&signalInfo.sa_mask);
   5:      signalInfo.sa_flags = SA_SIGINFO;
   6:      sigaction(SIGSEGV, &signalInfo, NULL);
   7:  } 

What is this code doing? First, sigaction is a struct provided by . The sa_sigaction field points to a function which will handle the signal when a SIGSEGV signal is thrown. sa_mask let's you describe which signals should be blocked from being caught in this process. sa_flags with sa_siginfo tells sigaction to use our custom signal handler, in this case a method named catchSignal. Finally, sigaction installs the actual signal. You can read more by reading the manpage for sigaction.

MProtecting the target page

Now that we can handle a signal, we can let the threads continue executing as is. However, let's say we want to change the currently executing code for a function foo. We want to stop all the threads so we have to throw the signal. We could just call raise(), which would immediatley stop all threads. However, the load from a page gives VMs one nice trick. The load should occur after a safepoint, which is where the VM can safely stop the world and then restore it once its business is finished. A VM cannot stop the world at every PC location, only at safepoints. Safepoints are points in the program that have metadata about the program state such that we can restore the program to it's correct state after some VM work. That's why loads are used instead of just raising the signal. Now let's mprotect the target page:

   1:  void markPageUnreadable(void* location) {
   2:    long address = (long) location;
   3:    void* pageStart = (void*)(address - (address % getpagesize()));
   4:    int protection = 0;
   5:    int result = mprotect(pageStart, getpagesize(), protection);
   6:    assert (result == 0);
   7:  }

mprotect only works at page boundaries. The instrumented load may not necessarily be page aligned, which means you have to find the page boundary before you can mprotect the page. Most unix* provide a function getpagesize(), which tells you how big a page of memory is. So all you need to do is pass in the location of the function, align the page, and mprotect the page with no read access. Now once a thread tries to load from the page, a signal will automatically be thrown! There is a small nuance here. Realistically, this means that once we mprotect a page, every thread does not stop immediatley. It only stops the next time it tries to load a value from the protected page. The VM must wait and check to see if every thread has stopped, but this usually happens very quickly. Another note is that we can't actually guarantee that the instrumented load will be the only valid load from a page. Hence, VMs malloc a whole page and ensure that this one page is dedicated only for this deopt purpose.

Catching the Signal

Now remember when we installed the signal, the sigaction.sa_sigaction = catchSignal? This means that we need a function catchSignal that will actually do some work. The nice thing is that each thread will automatically call this method, ensuring that the thread's PC is now outside of the page. Now in the signal handler, you can deopt, collect garbage, etc. Most threads can just busy wait or you can have one thread change the world. The catchSignal function must conform to the signal handler's function definition which is:

void handler(int, siginfo_t*, ucontext_t*);

Now let's see a sample implementation:

   1:  void catchSignal(int signum, siginfo_t* sigInfo, ucontext_t* context) {
   2:    void* global = sigInfo->si_addr;
   3:    if (cleanup) {
   4:      JIT::compile(global);
   5:    } else {
   6:      wait();
   7:    }
   8:    longjmp(..)
   9:  }

The sigInfo->si_addr tells us the thread's current PC when the signal was thrown. Now we know the address of the method we want to recompile it. A flag directs a single thread to perform some work while every other thread waits for a sign. Finally, the signal handler returns to the virtual PC where the current thread was executing. Since we invalidated the code, we don't actually want to return to the previous machine PC. setjmp/longjmp is useful for this reason - it is an actual machine jmp instruction where execution can continue somewhere else. So now we have one thread that will recompile the function while every other thread waits for the code to be patched.

Alter the Code in Memory

Since the world is effectively stopped, and only one thread is executing, the world looks pretty simple. We can now do standard JIT compilation techniques and patch any area of memory that we want. Once we've recompiled code, we just have to start the world again.

Resume All Threads

Now we JIT compiled or deoptimized a piece of code or cleaned up garbage. Now we want to resume the actual program execution, which means telling each thread to stop waiting() and resume execution. There are multiple ways to stop and start threads depending on your threading library. Pthreads has this notion of wait() and signal() which are different from system signals. Signal in pthreads means "wake up" and resume execution.

Conclusion

Today I Learned how to stop the world and patch some code. Done and done.

A Look at Go Locks

Let's take a look at how GO implements locks in the actual runtime. After Go's 6g compiler generates x86 code (in file.6), it calls into Go's runtime, which is linked in once a 6l to finally generate a binary executable. For example, any Go routine finally calls the scheduler, which locks the scheduler object, and executes the Go routine. How are basic locks implemented in Go? Let's look at the lock() routine:

   1:  runtime::lock(Lock *l)
   2:  {
   3:      if(m->locks < 0)
   4:          runtime::throw("lock count");
   5:      m->locks++;
   6:   
   7:      if(runtime::xadd(&l->key, 1) > 1) {    // someone else has it; wait
   8:          // Allocate semaphore if needed.
   9:          if(l->sema == 0)
  10:              initsema(&l->sema);
  11:          runtime::mach_semacquire(l->sema);
  12:      }
  13:  }

First off, Go uses semaphores as the global locking scheme. The number of semaphores depends on the number of CPUs on the host machine. Let's take a look at the Lock data structure:

   1:  struct    Lock
   2:  {
   3:      uint32    key;
   4:      uint32    sema;    
   5:  };

The Lock data structure just keeps count of the number of available resources. Next Lock() attempts to grab the semaphore via a call to xadd(). Let's see what xadd does:

   1:  asm.s
   2:  // uint32 xadd(uint32 volatile *val, int32 delta)
   3:  // Atomically:
   4:  //    *val += delta;
   5:  //    return *val;
   6:  TEXT runtime::xadd(SB), 7, $0
   7:      MOVQ    8(SP), BX
   8:      MOVL    16(SP), AX
   9:      MOVL    AX, CX
  10:      LOCK
  11:      XADDL    AX, 0(BX)
  12:      ADDL    CX, AX
  13:      RET

Great, now we can finally trace the call stack to see what's actually happening down at the machine level. Amazing that they hand wrote the assembly. There are two interesting instructions here: LOCK and XADDL. How do these things work? From the Intel assembly manual:

LOCK - Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multi-processor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted.

So the LOCK instruction will atomically lock the next instruction, which in this case is the XADDL. Semaphores have counters based on the number of available resources. LOCK will make the XADDL atomic. Conceptually, we are locking the semaphore counter, then atomically adds a number to it.

XADDL - Exchanges the first operand (destination operand) with the second operand (source operand), then loads the sum of the two values into the destination operand. The destination operand can be a register or a memory location; the source operand is a register.

XADDL increments the semaphore. Notice that the hand written assembly code actually ensures that the add occurs on the AX register, which is where the return value must be. The only question is why are we adding a number instead of subtracting? If nobody has the lock, the value in l->key should be 0. Adding 1 to it means we can grab the key (1 > 1 test fails). If the value is greater than 1, it means multiple processes are waiting for the resource. It doesn't seem to be asking the question, how many resources are left? Instead it's asking how many processes are waiting on me?

   1:  thread.c
   2:  void
   3:  runtime::mach_semacquire(uint32 sem)
   4:  {
   5:      int32 r;
   6:   
   7:      while((r = runtime::mach_semaphore_wait(sem)) != 0) {
   8:          if(r == KERN_ABORTED)    // interrupted
   9:              continue;
  10:          macherror(r, "semaphore_wait");
  11:      }
  12:  }

Go finally calls the OS X semaphore wait() routine to grab the lock on the semaphore. So it looks like we first grab a Lock on the lock object, which finally locks the semaphore from the OS by calling semaphore_wait_trap(). This makes sense because Go locks go routines, which are mapped onto multiple OS threads. What I think is happening is that the Go runtime locks an application object first (e.g. the scheduler). Then, it tries to grab the semaphore for the currently running thread (I think), at which point the code requesting the lock is free to execute.

   1:  sys.s
   2:  // uint32 mach_semaphore_wait(uint32)
   3:  TEXT runtime::mach_semaphore_wait(SB),7,$0
   4:      MOVL    8(SP), DI
   5:      MOVL    $(0x1000000+36), AX    // semaphore_wait_trap
   6:      SYSCALL
   7:      RET

Unlocking

Now let's see how GO unlocks. It's pretty much the same thing in reverse. The unlock procedure first tries to grab the Lock object and subtract one from the lock count.

   1:  void
   2:  runtime::unlock(Lock *l)
   3:  {
   4:      m->locks--;
   5:      if(m->locks < 0)
   6:          runtime::throw("lock count");
   7:   
   8:      if(runtime::xadd(&l->key, -1) > 0) {    // someone else is waiting
   9:          // Allocate semaphore if needed.
  10:          if(l->sema == 0)
  11:              initsema(&l->sema);
  12:          runtime::mach_semrelease(l->sema);
  13:      }
  14:  }

Now let's checkout mach_semrelease() , which performs a syscall to semaphore_signal()

   1:  thread.c
   2:  void
   3:  runtime::mach_semrelease(uint32 sem)
   4:  {
   5:      int32 r;
   6:   
   7:      while((r = runtime::mach_semaphore_signal(sem)) != 0) {
   8:          if(r == KERN_ABORTED)    // interrupted
   9:              continue;
  10:          macherror(r, "semaphore_signal");
  11:      }
  12:  }

And the final source for mac_semaphore_signal():

   1:  sys.s
   2:  // uint32 mach_semaphore_signal(uint32)
   3:  TEXT runtime::mach_semaphore_signal(SB),7,$0
   4:      MOVL    8(SP), DI
   5:      MOVL    $(0x1000000+33), AX    // semaphore_signal_trap
   6:      SYSCALL
   7:      RET

Now we can lock and unlock all we want.

Inspecting GO's Compiled Code

I wanted to learn a few things about concurrency before I graduate so I'm on a mission to look at Go and Rust. First stop is Go. Go has a lot of cool features, but the most interesting are related to goroutines and channels. I also want to see how they make goroutines and channel operations fast. And of course their compiler.

Their compiler first parses GO source and compiles x86 machine code as the VM traverses the AST. It's a surprisingly simple compiler that does not seem to perform a whole lot of optimizations. It very much has the flavor of a subroutine threaded interpreter. Lots of calls to the C runtime to do most of the heavy lifting. Consider the following GO source code:

   1:  func sum(a []int, c chan int) {
   2:      sum := 0
   3:       for _, v := range a {
   4:               sum += v
   5:       }
   6:       c <- sum  // send sum to c
   7:  }

Given an array of integers, iterate over the array and sum up all the elements in the array. Instead of returning the value of sum, we send it over a channel. Now let's take a look at the generated machine code:

   1:  JMP     3 // jump into loop
   2:  JMP     18 // Jump out of loop
   3:   
   4:  // Compare if we're done with loop
   5:  INCL    AX
   6:  CMPL    AX,DI
   7:  JGE     2
   8:   
   9:  // Loop body
  10:  MOVL    (CX),BP
  11:  ADDQ    $4,CX
  12:  ADDL    BP,SI
  13:   
  14:  // Jump back to loop header
  15:  JMP     5
  16:   
  17:  // Past loop
  18:  MOVQ    $type.chan int+0(SB),(SP)
  19:  ...
  20:  CALL    ,runtime.chansend1+0(SB)

It's a very weird control flow graph. Most loops look like this:

But go adds another block:

Go's loop exit condition actually jumps to another jump instruction which finally jumps to the point after the loop. What's going on? Why would there be a second jump unless it's just bad optimization or a bug. Maybe it's because the for loop is built to iterate over both the key and value of the array, not just the value? There's another block to test something on the keys? Let's try changing the GO source code to:

   1:  for i, v := range a {
   2:      sum += v
   3:      sum += i
   4:  }

And the result:

   1:  JMP     3
   2:  JMP     18

We still have the double jump.

GO's 6g Front End Compiler

I've been looking a bit through GO's 6g compiler. Their compiler starts by parsing the AST and generating machine code at each node which is later linked by 6l. The 6g / 6l names come from the Plan 9 compiler. Once they start parsing the AST, they use the Sethi-Ullman algorithm to minimize the number of registers used to generate code. 

I'm really struck by three things when looking at the GO compiler. First, why are there so many single letter variable names? It must be an old school C hacker thing. Second, I wish Eclipse + CDT had a better debugger. Finally, I know the language is called go, but why so many gotos? I've never seen a program make such judicious use of goto.

Tamarin on LLVM - More Numbers

After much tweaking, profiling, and hacking, we finally have a decent range of benchmarks and performance numbers of running Tamarin with LLVM's JIT. The initial performance of LLVM's JIT, which showed a neutral result, stay the same across both the V8 and Sunspider benchmark suites across fully typed, partially typed, and untyped ActionScript code.

Methodology

Each benchmark is compared against NanoJIT on fully type annotated ActionScript code. First, we fully type annotated the ActionScript source and looped the benchmark so that execution takes 30 seconds with NanoJIT. We then modified each of the partially typed and untyped benchmarks to execute the same number of loop iterations. All evaluations are performed on a 32 bit version of Windows 7, using an Intel Core 2 Duo E8400 3.0 Ghz Processor and 4 gigs of ram.

We have three test configurations:

  • An unmodified tamarin-redux branch with NanoJIT.
  • A base TESSA configuration which performs no high level optimizations and is a straightforward translation from TESSA IR to LLVM IR.
  • An optimized TESSA Configuration which does primitive method inlining, type inference, and finally dead code elimination. The optimized TESSA IR is then translated to LLVM IR. 

Every benchmark graph in the following sections are always compared to NanoJIT on typed code. A value of 100% indicates that the configuration is equal to NanoJIT on typed code. A value less than 100% means that the configuration is slower than NanoJIT and a value greater than 100% means that the configuration is faster than NanoJIT. All images are thumbnails and can be clicked on for a bigger version.

Full Disclosure: The v8/earley-boyer, and the sunspider benchmarks regexp-dna, date-format-xparb, string-base64, string-tagcloud, and s3d-raytrace are not represented in the following charts. V8 earley-boyer, regexp-dna, and string-base64 had known verifier bugs and are fixed in an updated tamarin-redux branch. The sunspider benchmarks date-format-xparb, and string-tagcloud use eval, which is not supported in ActionScript. The s3d-raytrace benchmark has a known JIT bug in the Tamarin-LLVM branch.

Fully Typed Code

When we look at the geometric mean across all test cases, we see that the performance numbers aren't really moving that much. However, looking at the individual test cases show a very wide sway either way. Overall, LLVM's JIT is roughly equal to NanoJIT. Let's first look at the big wins.

 Sunspider bitwise-and is 2.6X times faster with LLVM than with NanoJIT because of register allocation. The test case is a simple loop that performs one bitwise operation on one value. LLVM is able to keep both the loop counter and the bitwise and value in registers while NanoJIT must load and store each value across each loop iteration. v8/Crypto's performance gains originate from inlining the number to integer, which is detailed here. We see that type inference buys us a little more performance because we keep some temporary values in numbers rather than downcasting them to integers, buying a little more performance.

The biggest loss comes in the sunspider controlflow-recursive benchmark, which calculates the Fibonacci number. At each call to fib(), the property definition for fib() has to be resolved then called. However, the property lookup is a PURE method, which means it has no side effects, and can be common sub expression eliminated. LLVM's current common subexpression eliminator cannot eliminate the lookup call and must execute twice as many lookups as NanoJIT, resulting in the performance loss.

The other big hit is with v8 / splay, where type inference actually makes us lose performance. This originates from a fair number of poor type inference decisions in our type inference algorithm, creating more expensive type conversions than necessary. For example, two variables are declared an integer, but we determine one value is an integer and the other an unsigned integer. To actually compare the two variables, we have to convert both to floating point numbers and perform a floating compare.

Partially Typed Code

Partially typed code has all global variables and method signatures fully type annotated. All local variables in a method remain untyped. Here are the results:

 The most striking thing is that you instantly lose 50% of your performance just because of the lack of type information. Thankfully, some basic type inference is able to recover ~20% of the performance loss. However, the same story arises. NanoJIT is able to hold it's own against LLVM's JIT as both have roughly equal performance. We also start to see the payoffs with doing high level optimizations such as type inference.

The big wins again come in the bitops benchmarks for the same reason LLVM won in the typed benchmark. LLVM has a better register allocator and can keep values in registers rather than loading / storing values at each loop iteration. TESSA with type inference is able to win big on the sunspider/s3d and sunspider/math benchmarks for two key reasons. First, we're able to discern that some variables are arrays. Next, each array variable is being indexed by an integer rather than an any value. We can then compile array access code rather than generic get/set property code, enabling the performance gain. We also win quite a bit in v8/deltablue because we inline deltablue's many small methods and infer types through the inlined method.

However, we suffer big performance losses in cases such as access-nbody because the benchmark is dominated by property access code. Once we cannot statically bind any property names, property lookup and resolution become the major limiting factor in the benchmark. Finally, the biggest performance loss due to TESSA optimizations occurs in math-cordic for an unfortunate reason. We are able to determine one variable is a floating point number type, while another variable remains the any type. To compare the two values, we have to convert the number type to the any type at each loop iteration. The conversion also requires the GC to allocate 8 bytes of memory. So we suffer performance both from the actual conversion itself and the extra GC work required to hold the double.

Finally, we see that some benchmarks don't suffer at all from the lack of type information such as the string and regexp benchmarks. These benchmarks do not stress JIT compiled code and instead only stress VM C++ library code.

Untyped Code

Untyped code is essentially JavaScript code with all variables untyped. And the results are:

 Overall, you lose another 5-10% going from partially typed code to untyped code, and our high level optimizations aren't able to recover as much as we hoped. The general trend however is the same: LLVM's JIT and NanoJIT have essentially the same performance without high level optimizations. The big wins occur on the same benchmarks, and the performance wins/losses occur on the same tests for the same reasons, just all the performance numbers are lower.

LLVM Optimization Levels

When I first saw the results, I thought I must really be doing something incredibly wrong. I read the LLVM docs, Google FUed my way around to see what knobs LLVM had that could change performance. While most gcc flags work with LLVM, the most important one was the optimization level. LLVM has four optimization levels that generally correspond to GCC's -O flag: NONE (-O0), LESS (-O1), DEFAULT (-O2), and AGGRESSIVE (-O3). Here are the results of tweaking LLVM's optimization level with the V8 suite:

All results are against the baseline NONE level. We see that LLVM's optimizations are improving performance quite a bit, especially in the crytpo benchmark. However, after the LESS optimizations, we unfortunately, aren't seeing any performance gains. Thus, all benchmarks shown previously used the LESS optimization levels.

What next?

Overall, we see that LLVM's JIT isn't buying us much performance in the context of the Tamarin VM. Our experience leads us to the same conclusion as Reid Kleckner with his reflections on unladen-swallow. You really need a JIT that can take advantage of high level optimizations. We'll probably leave LLVM's JIT and convert TESSA IR into NanoJIT LIR and improve NanoJIT as much we can.