Tamarin on LLVM

LLVM is a great and mature piece of software with lots of support from numerous avenues. Naturally, one of the many ideas that floats around is how would Flash perform if the backend was LLVM instead of NanoJIT? [1]. LLVM performs many more optimizations than NanoJIT and has proven to be rock solid. NanoJIT was built to compile code fast [2], not necessarily the best performing. In theory, LLVM should easily beat NanoJIT right? My summer internship was to find out.

Architecture

There are a couple of ways to replace the backend with LLVM. The first would be a direct one to one translation from an input ABC file to LLVM IR. From there, LLVM could take care of the rest. This would be the most straightforward method of comparing NanoJIT with LLVM as NanoJIT translates ABC to LIR, then compiles x86 machine code. However, we would also waste an opportunity to do a nice and clean architecture that would enable us to plugin multiple backends.

Instead, we convert an ABC file into a new intermediate representation in SSA. This representation is code named Type Enriched SSA (TESSA), which is a new object oriented IR that will retain and use all the high level information about an ActionScript program. The TESSA IR itself is very much a work in progress. So far, there isn't any type enriching occurring, and you can think of TESSA as an object oriented instruction hierarchy in SSA for the rest of this post. Once the ABC is translated to TESSA, we "lower" the TESSA IR into LLVM IR, at which point LLVM generates native machine code and can perform all of it's optimizations. The semantics of the LLVM IR we create are almost identical to the NanoJIT LIR that is created in the normal Tamarin branch.

Results

LLVM is an object oriented approach to compiler construction and thus takes a bit longer to generate machine code than NanoJIT. LLVM has virtually the same amount of time to startup as NanoJIT, but compiling each method takes longer, which adds up over the course of a benchmark. This is a problem since the V8 suite runs in a few hundred milliseconds. Since we really are interested in the performance of the generated machine code rather than compilation time, we need to minimize the time spent compiling methods. We can do this by looping each V8 benchmark a few thousand times so that each benchmark takes about 30 seconds to execute. All test cases have manual type annotations as well. All tests were run on a mid 2010 Macbook Pro 2.66 Ghz Core i7, running on Windows 7 64 bit with LLVM 2.8. The results are:

We see a rather neutral result. Across the six benchmarks, LLVM wins or loses by a very small percentage, some of which is system noise. LLVM wins big on Richards but loses about the same amount on Crypto. What's really going on?

After some runs on VTune, we find that both Splay and RegExp spend most of their time in the VM and spend less than 2% of total execution time in jitted code. Raytrace spends 10% of it's time in jitted code and we see that LLVM is a little bit faster than NanoJIT. This means that LLVM can indeed produce better performing jit compiled code than NanoJIT. DeltaBlue, Richards, and Crypto spend around 30-50% of total execution time in jit compiled code and are the only test cases that are interesting to study.

DeltaBlue shows barely any difference because the benchmark contains many small method calls. Each method only performs a small calculation such as a comparison on a field or one branch. The execution time is dominated by the overhead of the method calls and therefore the types of optimizations LLVM performs won't help much because there isn't much to optimize. 

Richards is an interesting test because this is where LLVM wins by a decent margin. Here LLVM's optimization passes do quite a bit, with the biggest win coming from the fact that LLVM IR is in SSA while NanoJIT isn't. NanoJIT has to spill values out of registers in loops while LLVM is smart enough to keep them in registers, reducing the number of loads and stores in the hottest method by 25%. The whole +10% can't be attributed to one loop, but LLVM makes small gains across all the methods in Richards. 

Crypto is dominated by one method that performs numerous arithmetic operations in a tight loop in one method that dominates 90% of the execution time in jit compiled code. LLVM is losing here because of an unfortunate instruction scheduling decision which causes 65% more CPU pipeline stalls. However, after inlining the number to integer conversion, and applying Werner Sharp's patch to inline the vector get/set methods, LLVM's code no longer has the pipeline stall. NanoJIT and LLVM hit parity in this case.

Overall, the results are great for NanoJIT. It also means that we have to work on other parts of the VM until Tamarin sees big improvements on the V8 suite. Be forewarned, this isn't a conclusive experiment in any way shape or form, and evaluating a whole back end on 4 test cases isn't overwhelming evidence. As we build up TESSA to support more benchmarks, we may find LLVM pull ahead by a wide margin. Until then, it looks like a rather neutral result.

  1. LightSpark by Alessandro Pignotti is an open source Flash Player implementation that uses LLVM.
  2. Below is a comparison of compilation time of LLVM versus NanoJIT. This chart is measured in multiples, not percentage. (Eg Crypt compiles 6x slower with LLVM than NanoJIT). Note that the compilation time cannot be completely attributed to LLVM. The TESSA IR heavily uses the visitor pattern to traverse the object graph which contributes significantly to the compilation time. I don't have any hard numbers on how much time is spent in LLVM versus the TESSA IR.

    I interpret this finding as LLVM isn't necessarily slow versus NanoJIT compiles extremely fast. For example, the Richards benchmark takes 0.035 seconds to execute one iteration with NanoJIT. LLVM takes 0.110.

Updated to include compilation time.