Anatomy of a Low Level Performance Improvement

The past two weeks have been an enlightening time as we discovered a massive tamarin performance bottleneck on one of the V8 benchmark's. Consider the following ActionScript code:

public function am(i:int,x:int,w:BigInteger,j:int,c:int,n:int,_this:BigInteger):int {
        var this_array:Vector.<int> = _this.array;
        var w_array:Vector.<int>    = w.array;
 
        var xl:int = x&0x3fff, xh:int = x>>14;
        while(--n >= 0) {
            var l:int = this_array[i]&0x3fff;
            var h:int = this_array[i++]>>14;
            var m:int = xh*l+h*xl;
            l = xl*l+((m&0x3fff)<<14)+w_array[j] + c;
            c = (l>>28)+(m>>14)+xh*h;
            w_array[j++] = l&0xfffffff;
        }
        return c;
}

This method is the most critical path for the Crypto benchmark. Due to the nature of the verifier (which should get fixed), many of the additions are computed as floating point additions. This forces JIT compiled code to contain many integer to floating point value conversions.

According to the verifier, the result of the addition is a floating point value. However, the actual value of the result is an integer type. Once the result is converted into an integer, the integer value is used as an index to access an element of an integer vector. The result is that the JIT compile code converts an integer to a float, adds two floating point values, then converts the addition result back to an integer in a very tight loop.

To convert from a floating point number (ActionScript Number Type) to an integer, tamarin was calling a method integer_d_sse2, which for the most part was using the x86 instruction cvttsd2si. This single instruction truncates a floating point into an integer. However, due to some corner case in the ActionScript spec, integer_d_sse2 had to do some extra check to ensure that the converted integer value was correct. However, in the Crypto benchmark, the corner cases never happened, and tamarin was paying the call overhead every time to in a very tight loop.

int AvmCore::integer_d_sse2(double doubleValue) {
    int intval;
    _asm {
        cvttsd2si eax, doubleValue
        mov intval,eax
    }
 
    if (intval != 0x80000000)
        return intval;
    else 
        return doubleToInt32(doubleValue);
}

Once we inlined the x86 cvttsd2si instruction, along with a small check in the corner cases, the performance of the Crypto benchmark increased by 40%. 40% from one single small inlining in a hot method. In addition, Werner Smith, upon measuring the improvement from inlining the floating point to integer conversion, decided to inline the get/set integer vector methods.

The original C++ integer vector get/set methods are:

int VectorClass::_getNativeIntProperty(int index) const {
    if( m_length <= uint32(index) ) {
        throw new RangeError(...)
    }
    return m_array[index];
}

void VectorClass::_setNativeUintProperty(uint32 index, int value) {
    if (m_length <= index) {
        grow(index+1);
        m_length = index+1;
    }
 
    m_array[index] = value;
}

This was inlined to only the check that the index being set in the vector is less than or equal to the length of the vector. Otherwise, call an error. This inlining optimization improved the performance of the Crypto benchmark by another 20%. In total, by inlining two very small methods, the performance of the Crypto v8 benchmark shot up an impressive 52%.

You can follow the performance improvement in other benchmarks at bugzilla. Checkout this link for inlining the floating pointer to integer instruction and here for inlining the integer vector get/set property methods.