The SquirrelFish team has been pretty good at documenting what makes SquirrelFish really fast. Specifically in their blog post announcing SquirrelFish Extreme, they link to the papers that introduced some of the techniques they used. David Mandelin also has a really good explanation on his blog about what these techniques really are, and some sample source code. One of the major reasons SquirrelFish is so fast is because of the design of their bytecodes and dispatch technique. What are bytecodes and the different dispatch techniques? Why does dispatch even matter?
Bytecode is another representation of a language's source code, but in a much more compact fashion. An interpreter executes bytecode in sequence, dispatching or jumping to the next bytecode to execute. The performance overhead of dispatching between opcodes can be very expensive. By changing dispatch techniques, you can have significant performance gains. Python received an average of 15-20% speedup by using threaded code.
You can also reduce dispatch overhead by having fewer opcodes. Thus you get the question of do you want "fat" opcodes which conceptually do many things, or do you want really "skinny" opcodes that are really fine grained. If each opcode does more work, you need fewer opcodes to accomplish something, thus reducing dispatch overhead. SquirrelFish Extreme has really fat opcodes, which allows it to minimize the performance overheard of interpreter dispatch. You can see the definition for every bytecode that SquirrelFish uses on the WebKit website. If you want to go in depth about bytecode dispatch, Anton Ertl breaks down all the different dispatch techniques. If you really want to go the academic route, he also has a paper entitled The Structure and Performance of Efficient Interpreters.