A Look at Go Locks

Let's take a look at how GO implements locks in the actual runtime. After Go's 6g compiler generates x86 code (in file.6), it calls into Go's runtime, which is linked in once a 6l to finally generate a binary executable. For example, any Go routine finally calls the scheduler, which locks the scheduler object, and executes the Go routine. How are basic locks implemented in Go? Let's look at the lock() routine:

   1:  runtime::lock(Lock *l)
   2:  {
   3:      if(m->locks < 0)
   4:          runtime::throw("lock count");
   5:      m->locks++;
   6:   
   7:      if(runtime::xadd(&l->key, 1) > 1) {    // someone else has it; wait
   8:          // Allocate semaphore if needed.
   9:          if(l->sema == 0)
  10:              initsema(&l->sema);
  11:          runtime::mach_semacquire(l->sema);
  12:      }
  13:  }

First off, Go uses semaphores as the global locking scheme. The number of semaphores depends on the number of CPUs on the host machine. Let's take a look at the Lock data structure:

   1:  struct    Lock
   2:  {
   3:      uint32    key;
   4:      uint32    sema;    
   5:  };

The Lock data structure just keeps count of the number of available resources. Next Lock() attempts to grab the semaphore via a call to xadd(). Let's see what xadd does:

   1:  asm.s
   2:  // uint32 xadd(uint32 volatile *val, int32 delta)
   3:  // Atomically:
   4:  //    *val += delta;
   5:  //    return *val;
   6:  TEXT runtime::xadd(SB), 7, $0
   7:      MOVQ    8(SP), BX
   8:      MOVL    16(SP), AX
   9:      MOVL    AX, CX
  10:      LOCK
  11:      XADDL    AX, 0(BX)
  12:      ADDL    CX, AX
  13:      RET

Great, now we can finally trace the call stack to see what's actually happening down at the machine level. Amazing that they hand wrote the assembly. There are two interesting instructions here: LOCK and XADDL. How do these things work? From the Intel assembly manual:

LOCK - Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multi-processor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted.

So the LOCK instruction will atomically lock the next instruction, which in this case is the XADDL. Semaphores have counters based on the number of available resources. LOCK will make the XADDL atomic. Conceptually, we are locking the semaphore counter, then atomically adds a number to it.

XADDL - Exchanges the first operand (destination operand) with the second operand (source operand), then loads the sum of the two values into the destination operand. The destination operand can be a register or a memory location; the source operand is a register.

XADDL increments the semaphore. Notice that the hand written assembly code actually ensures that the add occurs on the AX register, which is where the return value must be. The only question is why are we adding a number instead of subtracting? If nobody has the lock, the value in l->key should be 0. Adding 1 to it means we can grab the key (1 > 1 test fails). If the value is greater than 1, it means multiple processes are waiting for the resource. It doesn't seem to be asking the question, how many resources are left? Instead it's asking how many processes are waiting on me?

   1:  thread.c
   2:  void
   3:  runtime::mach_semacquire(uint32 sem)
   4:  {
   5:      int32 r;
   6:   
   7:      while((r = runtime::mach_semaphore_wait(sem)) != 0) {
   8:          if(r == KERN_ABORTED)    // interrupted
   9:              continue;
  10:          macherror(r, "semaphore_wait");
  11:      }
  12:  }

Go finally calls the OS X semaphore wait() routine to grab the lock on the semaphore. So it looks like we first grab a Lock on the lock object, which finally locks the semaphore from the OS by calling semaphore_wait_trap(). This makes sense because Go locks go routines, which are mapped onto multiple OS threads. What I think is happening is that the Go runtime locks an application object first (e.g. the scheduler). Then, it tries to grab the semaphore for the currently running thread (I think), at which point the code requesting the lock is free to execute.

   1:  sys.s
   2:  // uint32 mach_semaphore_wait(uint32)
   3:  TEXT runtime::mach_semaphore_wait(SB),7,$0
   4:      MOVL    8(SP), DI
   5:      MOVL    $(0x1000000+36), AX    // semaphore_wait_trap
   6:      SYSCALL
   7:      RET

Unlocking

Now let's see how GO unlocks. It's pretty much the same thing in reverse. The unlock procedure first tries to grab the Lock object and subtract one from the lock count.

   1:  void
   2:  runtime::unlock(Lock *l)
   3:  {
   4:      m->locks--;
   5:      if(m->locks < 0)
   6:          runtime::throw("lock count");
   7:   
   8:      if(runtime::xadd(&l->key, -1) > 0) {    // someone else is waiting
   9:          // Allocate semaphore if needed.
  10:          if(l->sema == 0)
  11:              initsema(&l->sema);
  12:          runtime::mach_semrelease(l->sema);
  13:      }
  14:  }

Now let's checkout mach_semrelease() , which performs a syscall to semaphore_signal()

   1:  thread.c
   2:  void
   3:  runtime::mach_semrelease(uint32 sem)
   4:  {
   5:      int32 r;
   6:   
   7:      while((r = runtime::mach_semaphore_signal(sem)) != 0) {
   8:          if(r == KERN_ABORTED)    // interrupted
   9:              continue;
  10:          macherror(r, "semaphore_signal");
  11:      }
  12:  }

And the final source for mac_semaphore_signal():

   1:  sys.s
   2:  // uint32 mach_semaphore_signal(uint32)
   3:  TEXT runtime::mach_semaphore_signal(SB),7,$0
   4:      MOVL    8(SP), DI
   5:      MOVL    $(0x1000000+33), AX    // semaphore_signal_trap
   6:      SYSCALL
   7:      RET

Now we can lock and unlock all we want.

Inspecting GO's Compiled Code

I wanted to learn a few things about concurrency before I graduate so I'm on a mission to look at Go and Rust. First stop is Go. Go has a lot of cool features, but the most interesting are related to goroutines and channels. I also want to see how they make goroutines and channel operations fast. And of course their compiler.

Their compiler first parses GO source and compiles x86 machine code as the VM traverses the AST. It's a surprisingly simple compiler that does not seem to perform a whole lot of optimizations. It very much has the flavor of a subroutine threaded interpreter. Lots of calls to the C runtime to do most of the heavy lifting. Consider the following GO source code:

   1:  func sum(a []int, c chan int) {
   2:      sum := 0
   3:       for _, v := range a {
   4:               sum += v
   5:       }
   6:       c <- sum  // send sum to c
   7:  }

Given an array of integers, iterate over the array and sum up all the elements in the array. Instead of returning the value of sum, we send it over a channel. Now let's take a look at the generated machine code:

   1:  JMP     3 // jump into loop
   2:  JMP     18 // Jump out of loop
   3:   
   4:  // Compare if we're done with loop
   5:  INCL    AX
   6:  CMPL    AX,DI
   7:  JGE     2
   8:   
   9:  // Loop body
  10:  MOVL    (CX),BP
  11:  ADDQ    $4,CX
  12:  ADDL    BP,SI
  13:   
  14:  // Jump back to loop header
  15:  JMP     5
  16:   
  17:  // Past loop
  18:  MOVQ    $type.chan int+0(SB),(SP)
  19:  ...
  20:  CALL    ,runtime.chansend1+0(SB)

It's a very weird control flow graph. Most loops look like this:

But go adds another block:

Go's loop exit condition actually jumps to another jump instruction which finally jumps to the point after the loop. What's going on? Why would there be a second jump unless it's just bad optimization or a bug. Maybe it's because the for loop is built to iterate over both the key and value of the array, not just the value? There's another block to test something on the keys? Let's try changing the GO source code to:

   1:  for i, v := range a {
   2:      sum += v
   3:      sum += i
   4:  }

And the result:

   1:  JMP     3
   2:  JMP     18

We still have the double jump.

GO's 6g Front End Compiler

I've been looking a bit through GO's 6g compiler. Their compiler starts by parsing the AST and generating machine code at each node which is later linked by 6l. The 6g / 6l names come from the Plan 9 compiler. Once they start parsing the AST, they use the Sethi-Ullman algorithm to minimize the number of registers used to generate code. 

I'm really struck by three things when looking at the GO compiler. First, why are there so many single letter variable names? It must be an old school C hacker thing. Second, I wish Eclipse + CDT had a better debugger. Finally, I know the language is called go, but why so many gotos? I've never seen a program make such judicious use of goto.