Tamarin on LLVM

LLVM is a great and mature piece of software with lots of support from numerous avenues. Naturally, one of the many ideas that floats around is how would Flash perform if the backend was LLVM instead of NanoJIT? [1]. LLVM performs many more optimizations than NanoJIT and has proven to be rock solid. NanoJIT was built to compile code fast [2], not necessarily the best performing. In theory, LLVM should easily beat NanoJIT right? My summer internship was to find out.

Architecture

There are a couple of ways to replace the backend with LLVM. The first would be a direct one to one translation from an input ABC file to LLVM IR. From there, LLVM could take care of the rest. This would be the most straightforward method of comparing NanoJIT with LLVM as NanoJIT translates ABC to LIR, then compiles x86 machine code. However, we would also waste an opportunity to do a nice and clean architecture that would enable us to plugin multiple backends.

Instead, we convert an ABC file into a new intermediate representation in SSA. This representation is code named Type Enriched SSA (TESSA), which is a new object oriented IR that will retain and use all the high level information about an ActionScript program. The TESSA IR itself is very much a work in progress. So far, there isn't any type enriching occurring, and you can think of TESSA as an object oriented instruction hierarchy in SSA for the rest of this post. Once the ABC is translated to TESSA, we "lower" the TESSA IR into LLVM IR, at which point LLVM generates native machine code and can perform all of it's optimizations. The semantics of the LLVM IR we create are almost identical to the NanoJIT LIR that is created in the normal Tamarin branch.

Results

LLVM is an object oriented approach to compiler construction and thus takes a bit longer to generate machine code than NanoJIT. LLVM has virtually the same amount of time to startup as NanoJIT, but compiling each method takes longer, which adds up over the course of a benchmark. This is a problem since the V8 suite runs in a few hundred milliseconds. Since we really are interested in the performance of the generated machine code rather than compilation time, we need to minimize the time spent compiling methods. We can do this by looping each V8 benchmark a few thousand times so that each benchmark takes about 30 seconds to execute. All test cases have manual type annotations as well. All tests were run on a mid 2010 Macbook Pro 2.66 Ghz Core i7, running on Windows 7 64 bit with LLVM 2.8. The results are:

We see a rather neutral result. Across the six benchmarks, LLVM wins or loses by a very small percentage, some of which is system noise. LLVM wins big on Richards but loses about the same amount on Crypto. What's really going on?

After some runs on VTune, we find that both Splay and RegExp spend most of their time in the VM and spend less than 2% of total execution time in jitted code. Raytrace spends 10% of it's time in jitted code and we see that LLVM is a little bit faster than NanoJIT. This means that LLVM can indeed produce better performing jit compiled code than NanoJIT. DeltaBlue, Richards, and Crypto spend around 30-50% of total execution time in jit compiled code and are the only test cases that are interesting to study.

DeltaBlue shows barely any difference because the benchmark contains many small method calls. Each method only performs a small calculation such as a comparison on a field or one branch. The execution time is dominated by the overhead of the method calls and therefore the types of optimizations LLVM performs won't help much because there isn't much to optimize. 

Richards is an interesting test because this is where LLVM wins by a decent margin. Here LLVM's optimization passes do quite a bit, with the biggest win coming from the fact that LLVM IR is in SSA while NanoJIT isn't. NanoJIT has to spill values out of registers in loops while LLVM is smart enough to keep them in registers, reducing the number of loads and stores in the hottest method by 25%. The whole +10% can't be attributed to one loop, but LLVM makes small gains across all the methods in Richards. 

Crypto is dominated by one method that performs numerous arithmetic operations in a tight loop in one method that dominates 90% of the execution time in jit compiled code. LLVM is losing here because of an unfortunate instruction scheduling decision which causes 65% more CPU pipeline stalls. However, after inlining the number to integer conversion, and applying Werner Sharp's patch to inline the vector get/set methods, LLVM's code no longer has the pipeline stall. NanoJIT and LLVM hit parity in this case.

Overall, the results are great for NanoJIT. It also means that we have to work on other parts of the VM until Tamarin sees big improvements on the V8 suite. Be forewarned, this isn't a conclusive experiment in any way shape or form, and evaluating a whole back end on 4 test cases isn't overwhelming evidence. As we build up TESSA to support more benchmarks, we may find LLVM pull ahead by a wide margin. Until then, it looks like a rather neutral result.

  1. LightSpark by Alessandro Pignotti is an open source Flash Player implementation that uses LLVM.
  2. Below is a comparison of compilation time of LLVM versus NanoJIT. This chart is measured in multiples, not percentage. (Eg Crypt compiles 6x slower with LLVM than NanoJIT). Note that the compilation time cannot be completely attributed to LLVM. The TESSA IR heavily uses the visitor pattern to traverse the object graph which contributes significantly to the compilation time. I don't have any hard numbers on how much time is spent in LLVM versus the TESSA IR.

    I interpret this finding as LLVM isn't necessarily slow versus NanoJIT compiles extremely fast. For example, the Richards benchmark takes 0.035 seconds to execute one iteration with NanoJIT. LLVM takes 0.110.

Updated to include compilation time.

AS/JS Virtual Machine Performance Race

This summer, I'm luckily interning at Adobe working on Tamarin. Since I haven't touched Tamarin in a while, I thought it should be worthwhile to see how Tamarin compares to all the other JavaScript virtual machines. Although Tamarin isn't a JavaScript VM, there aren't any other ActionScript VMs to compare it to. Without further adieu, the SunSpider benchmark contenders:

  • Tamarin-Central: Stable Flash VM. Runs ActionScript
  • Tamarin-Redux: The tip for Tamarin-Central. Every once in a while the Tamarin-Redux changes are pushed into the Tamarin-Central branch
  • TraceMonkey: Tracing JIT slated for shipment in Firefox 3.1
  • V8: Used by Google Chrome
  • Nitro: Previously named SquirrelFish Extreme, the VM in Safari.
A few notes:
  • Tamarin-Central/Redux AS3 - SunSpider tests written with static typing. Only available for Flash VMs.
  • All VMs are from their respective repositories and checked out 3/22/09 ~6 PM PST. Built with optimizations/release builds.
  • All tests run on a MacBook Pro 2.4 Ghz w/ 4 Gigs of ram.
  • These tests run really really fast (some less than three ms)
  • The baseline is the Tamarin-Central interpreter (JIT disabled)
  • Some SunSpider tests are omitted as Tamarin can't run them because of known issues.
  • Take these numbers with a grain of salt :)

TraceMonkey vs SquirrelFish

With the announcement of TraceMonkey, we saw that SpiderMonkey got a huge improvement. However, in previous benchmarks, SpiderMonkey was still slower than SquirrelFish. So, how does Trace Monkey compare versus SquirrelFish:

Overall, if you aggregate the time for each test, TraceMonkey is about 15% faster. If you do the aggregate speedup, which is the speedup for each test divided by the number of tests, it is about 2.4 x faster. Sometimes TraceMonkey is slower, but give them more than 60 days to hash it out, and I'm sure it'll get faster.


Notes: SquirrelFish and TraceMonkey were with release builds. Tests run on a MacBook Pro 2.4 ghz, 4 gigs of ram, OS X Leopard 10.5.4. TraceMonkey was just pulled about an hour ago. SquirrelFish was last nights SVN.

Good Job Guys. More info on TraceMonkey at Andreas Gal's Blog, Mike Shaver, David Anderson, and for a real visual demo, checkout Mike Schroepfer's blog.

SquirrelFish and SpiderMonkey play with SunSpider

How many awesome names can we have? LOL I love these hybrid animals. A comment on Slashdot says its like the new Web 2.0 naming scheme. Reminds me of SpiderPig! Anyway, the Webkit guys released their new JavaScript Interpreter, squirrelfish. The benchmarks show that it is pretty fast. Cameron has a post detailing the overall picture, but what about the micro picture? The following is a graph of each SunSpider test run on SquirrelFish(WebKit-r34342) and SpiderMonkey 1.8 (built on May 22 2008 at 13:31:57). The shorter the bar the better.


If you wish to rerun these tests, heres how to build WebKit. If you specifically want to build just SquirrelFish, use the following command:

1) WebKitTools/Scripts/build-testkjs
2) The binary is /WebKitBuild/Release/testkjs

Props to the webkit guys.

* Edit as Mark mentioned. All tests run on OS X 10.5.3 on a Macbook Pro 2.4 ghz w/ 2 gigs of ram. Both interpreters were built on my machine with optimizations/release build.

A Better View of Tamarin and SpiderMonkey's Interpreter

There was some discussion this morning about whether or not it was possible to get Tamarin's Interpreter up to speed with SpiderMonkey, and if so, how far is Tamarin? Instead of using JavaGrande, we decided to use SunSpider, since it has some tests which use Objects. If anyone's interested, you can only get SunSpider from their source repository. All benchmarks were done on a MacBook pro 2.4 ghz with 2 gigs of ram. Both were built as Release/Optimizations enabled.

Note that with Tamarin, the -interp flag has to be set to test the interpreter only. I couldn't find a way to run it through the sunspider test harness. So I wrote my own bash script which invoked Tamarin 10 times, and used the unix 'time' results. Also, some of these tests failed.

The picture below is how much slower Tamarin is compared to SpiderMonkey. The smaller the bar, the faster Tamarin executed the test. The graphic only shows tests which passed. There was no test in which Tamarin was faster than SpiderMonkey, therefore this pic only shows how much slower Tamarin was than SpiderMonkey. Below the pics are the actual numbers.


SpiderMonkey 1.8
built on May 22 2008 at 13:31:57

3d: 347.4ms +/- 0.4%
cube: 126.6ms +/- 0.5%
morph: 118.0ms +/- 0.4%
raytrace: 102.8ms +/- 0.4%

access: 415.5ms +/- 1.5%
binary-trees: 43.3ms +/- 4.0%
fannkuch: 193.3ms +/- 1.4%
nbody: 125.1ms +/- 1.4%
nsieve: 53.8ms +/- 0.6%

bitops: 397.8ms +/- 0.4%
3bit-bits-in-byte: 49.6ms +/- 1.0%
bits-in-byte: 79.1ms +/- 0.3%
bitwise-and: 178.3ms +/- 0.4%
nsieve-bits: 90.8ms +/- 0.7%

controlflow: 35.1ms +/- 0.6%
recursive: 35.1ms +/- 0.6%

crypto: 169.9ms +/- 0.3%
aes: 66.1ms +/- 0.3%
md5: 51.9ms +/- 0.8%
sha1: 51.9ms +/- 0.4%

date: 269.3ms +/- 0.4%
format-tofte: 160.3ms +/- 0.3%
format-xparb: 109.0ms +/- 0.7%

math: 288.5ms +/- 0.3%
cordic: 128.9ms +/- 0.4%
partial-sums: 102.0ms +/- 0.3%
spectral-norm: 57.6ms +/- 0.6%

regexp: 251.9ms +/- 0.2%
dna: 251.9ms +/- 0.2%

string: 594.0ms +/- 0.3%
base64: 68.7ms +/- 0.5%
fasta: 148.7ms +/- 0.2%
tagcloud: 134.2ms +/- 0.8%
unpack-code: 169.4ms +/- 0.2%
validate-input: 73.0ms +/- 0.7%

Tamarin:
3d: 2035 ms
cube: 457 ms
morph: 608 ms
raytrace: 970 ms

access: 3548 ms
binary-trees: 363 ms
fannkuch: 1819 ms
nbody: 1047 ms
nsieve: 319 ms

bitops: 3929 ms
3bit-bits-in-byte: 275 ms
bits-in-byte: 371 ms
bitwise-and: 2454 ms
nsieve-bits: 829 ms

controlflow: 282 ms
recursive: 282 ms

crypto: 1440 ms
aes: 785 ms
md5: 322 ms
sha1: 333 ms

date: Fail
format-tofte: Reference Error #1065
format-xparb: Reference Error #1065

math: 2637 ms
cordic: 616 ms
partial-sum 1584 ms
spectral-norm: 437 ms

regexp: Fail
dna: ArgumentError: Error #1063

string: 56445 ms
base64: ReferenceError: Error #1069
fasta: 837 ms
tagcloud: ReferenceError: #1065
unpack-code: 50234 ms
validate-input: 5374 ms

Tamarin is slower. The worst part has to be playing with strings. Ouch.