Tamarin on LLVM - More Numbers

After much tweaking, profiling, and hacking, we finally have a decent range of benchmarks and performance numbers of running Tamarin with LLVM's JIT. The initial performance of LLVM's JIT, which showed a neutral result, stay the same across both the V8 and Sunspider benchmark suites across fully typed, partially typed, and untyped ActionScript code.

Methodology

Each benchmark is compared against NanoJIT on fully type annotated ActionScript code. First, we fully type annotated the ActionScript source and looped the benchmark so that execution takes 30 seconds with NanoJIT. We then modified each of the partially typed and untyped benchmarks to execute the same number of loop iterations. All evaluations are performed on a 32 bit version of Windows 7, using an Intel Core 2 Duo E8400 3.0 Ghz Processor and 4 gigs of ram.

We have three test configurations:

  • An unmodified tamarin-redux branch with NanoJIT.
  • A base TESSA configuration which performs no high level optimizations and is a straightforward translation from TESSA IR to LLVM IR.
  • An optimized TESSA Configuration which does primitive method inlining, type inference, and finally dead code elimination. The optimized TESSA IR is then translated to LLVM IR. 

Every benchmark graph in the following sections are always compared to NanoJIT on typed code. A value of 100% indicates that the configuration is equal to NanoJIT on typed code. A value less than 100% means that the configuration is slower than NanoJIT and a value greater than 100% means that the configuration is faster than NanoJIT. All images are thumbnails and can be clicked on for a bigger version.

Full Disclosure: The v8/earley-boyer, and the sunspider benchmarks regexp-dna, date-format-xparb, string-base64, string-tagcloud, and s3d-raytrace are not represented in the following charts. V8 earley-boyer, regexp-dna, and string-base64 had known verifier bugs and are fixed in an updated tamarin-redux branch. The sunspider benchmarks date-format-xparb, and string-tagcloud use eval, which is not supported in ActionScript. The s3d-raytrace benchmark has a known JIT bug in the Tamarin-LLVM branch.

Fully Typed Code

When we look at the geometric mean across all test cases, we see that the performance numbers aren't really moving that much. However, looking at the individual test cases show a very wide sway either way. Overall, LLVM's JIT is roughly equal to NanoJIT. Let's first look at the big wins.

 Sunspider bitwise-and is 2.6X times faster with LLVM than with NanoJIT because of register allocation. The test case is a simple loop that performs one bitwise operation on one value. LLVM is able to keep both the loop counter and the bitwise and value in registers while NanoJIT must load and store each value across each loop iteration. v8/Crypto's performance gains originate from inlining the number to integer, which is detailed here. We see that type inference buys us a little more performance because we keep some temporary values in numbers rather than downcasting them to integers, buying a little more performance.

The biggest loss comes in the sunspider controlflow-recursive benchmark, which calculates the Fibonacci number. At each call to fib(), the property definition for fib() has to be resolved then called. However, the property lookup is a PURE method, which means it has no side effects, and can be common sub expression eliminated. LLVM's current common subexpression eliminator cannot eliminate the lookup call and must execute twice as many lookups as NanoJIT, resulting in the performance loss.

The other big hit is with v8 / splay, where type inference actually makes us lose performance. This originates from a fair number of poor type inference decisions in our type inference algorithm, creating more expensive type conversions than necessary. For example, two variables are declared an integer, but we determine one value is an integer and the other an unsigned integer. To actually compare the two variables, we have to convert both to floating point numbers and perform a floating compare.

Partially Typed Code

Partially typed code has all global variables and method signatures fully type annotated. All local variables in a method remain untyped. Here are the results:

 The most striking thing is that you instantly lose 50% of your performance just because of the lack of type information. Thankfully, some basic type inference is able to recover ~20% of the performance loss. However, the same story arises. NanoJIT is able to hold it's own against LLVM's JIT as both have roughly equal performance. We also start to see the payoffs with doing high level optimizations such as type inference.

The big wins again come in the bitops benchmarks for the same reason LLVM won in the typed benchmark. LLVM has a better register allocator and can keep values in registers rather than loading / storing values at each loop iteration. TESSA with type inference is able to win big on the sunspider/s3d and sunspider/math benchmarks for two key reasons. First, we're able to discern that some variables are arrays. Next, each array variable is being indexed by an integer rather than an any value. We can then compile array access code rather than generic get/set property code, enabling the performance gain. We also win quite a bit in v8/deltablue because we inline deltablue's many small methods and infer types through the inlined method.

However, we suffer big performance losses in cases such as access-nbody because the benchmark is dominated by property access code. Once we cannot statically bind any property names, property lookup and resolution become the major limiting factor in the benchmark. Finally, the biggest performance loss due to TESSA optimizations occurs in math-cordic for an unfortunate reason. We are able to determine one variable is a floating point number type, while another variable remains the any type. To compare the two values, we have to convert the number type to the any type at each loop iteration. The conversion also requires the GC to allocate 8 bytes of memory. So we suffer performance both from the actual conversion itself and the extra GC work required to hold the double.

Finally, we see that some benchmarks don't suffer at all from the lack of type information such as the string and regexp benchmarks. These benchmarks do not stress JIT compiled code and instead only stress VM C++ library code.

Untyped Code

Untyped code is essentially JavaScript code with all variables untyped. And the results are:

 Overall, you lose another 5-10% going from partially typed code to untyped code, and our high level optimizations aren't able to recover as much as we hoped. The general trend however is the same: LLVM's JIT and NanoJIT have essentially the same performance without high level optimizations. The big wins occur on the same benchmarks, and the performance wins/losses occur on the same tests for the same reasons, just all the performance numbers are lower.

LLVM Optimization Levels

When I first saw the results, I thought I must really be doing something incredibly wrong. I read the LLVM docs, Google FUed my way around to see what knobs LLVM had that could change performance. While most gcc flags work with LLVM, the most important one was the optimization level. LLVM has four optimization levels that generally correspond to GCC's -O flag: NONE (-O0), LESS (-O1), DEFAULT (-O2), and AGGRESSIVE (-O3). Here are the results of tweaking LLVM's optimization level with the V8 suite:

All results are against the baseline NONE level. We see that LLVM's optimizations are improving performance quite a bit, especially in the crytpo benchmark. However, after the LESS optimizations, we unfortunately, aren't seeing any performance gains. Thus, all benchmarks shown previously used the LESS optimization levels.

What next?

Overall, we see that LLVM's JIT isn't buying us much performance in the context of the Tamarin VM. Our experience leads us to the same conclusion as Reid Kleckner with his reflections on unladen-swallow. You really need a JIT that can take advantage of high level optimizations. We'll probably leave LLVM's JIT and convert TESSA IR into NanoJIT LIR and improve NanoJIT as much we can.