Project Silk

For the past few months, I've been working on Project Silk which improves smoothness across the browser. Very much like Project Butter for Android, part of it is finally live on Firefox OS. Silk does three things:

  1. Align Painting with hardware vsync
  2. Resample touch input events based on hardware vsync
  3. Align composites with hardware vsync

What is vsync, why vsync and why does it matter at all?

Vertical synchronization occurs when the hardware display shows a new frame onto the screen. This rate is set by specific hardware, but major displays in the US occur at 60 times a second, or every 16.6 ms. This is where you hear about 60 frames per second, one frame every time the hardware display refreshes. What this means in reality is that no matter how many frames are produced in software, the hardware display will still only show at most 60 unique frames per second.

Currently in Firefox, we mimic 60 frames per second and therefore vsync with a software timer that schedules rendering every 16.6 ms. However, the software scheduler has two problems: (a) it's noisy and (b) can be scheduled at bad times relative to vsync.

In regards to noise, software timers are much noiser than hardware timers. This creates micro-jank for a number of reasons. First, many animations are keyed off timestamps that occur from the software scheduler to update the position of the animation. If you've ever used requestAnimationFrame, you get a timestamp from a software timer. If we want smooth animations, the timestamp provided to requestAnimationFrame should be uniform. Non-uniform timestamps will create non-uniform and janky animations. Here is a graph showing software versus hardware vsync timer uniformity:

Wow! That's a lot better with a hardware timer. With hardware timers, we get a much more uniform, and therefore smoother timestamp to key animations off of. So that's problem (a), noisy timers in software versus hardware.

With part (b), software timers can be scheduled at bad times relative to vsync. Regardless of what software does, the hardware display will refresh on it's own clock. If our rendering pipeline finishes producing a frame before the next vsync, the display is updated with new content. If we fail to finish producing a frame before the next vsync, the previous frame will be displayed, causing jankiness. Some rendering functions can occur close to vsync and overflow until the next interval, so we actually introduce potentially more latency since the frame won't be displayed on the screen anyway until the next vsync. Let's look at this in graphic form:

At time 0, we start producing frames. Let's say all frames for the sake of example take a constant time of 10 ms. Our frame budget is 16.6 ms because we only have to finish producing a frame before the hardware vsync occurs. Since frame 1 is finished 6 ms before the next vsync (time t=16 ms), everything is successful and life is good. The frame is produced in time and the hardware display will be refreshed with the updated content.

Now let's look at Frame 2. Since software timers are noisy, we start producing a frame 9 ms from the next vsync (time t=32). Since our frame takes 10 ms to produce, we'll actually finish producing this frame at 1 ms AFTER the next vsync. That means at vsync number 2 (t=32), there was no new frame to display, so the display still shows the previous frame. In addition, the frame just produced won't be shown until vsync 3 (t=48), because that's when hardware updates itself. This creates jank since now the display will have skipped 1 frame and try to catch up in the upcoming frames. This also produces one extra frame of latency, which is terrible for games.

Vsync improves both of these problems since we get both a much more uniform timer and the maximum amount of frame budget time to produce a new frame. Now that we know what vsync is, we can finally go onto what Project Silk is and why it helps create smooth experiences in Firefox.

The Rendering Pipeline

Gecko's rendering pipeline in super simplified terms does three things:

  1. Paint / draw the new frame on the main thread.
  2. Send the updated content to the Compositor via a LayerTransaction.
  3. Composite the new content.

In the ideal world, we'd be able to do all three steps within 16.6 ms, but that's not the case most of the time. Both steps (1) and (3) are on independent software timers. Thus there was no real synchronizing clock between the three steps, they are all ad-hoc. They also had no relation to vsync, so the timing of the pipeline wasn't related to when the display would actually update the screen with the content. With Silk, we replace both independent software timers with the hardware vsync timer. For our purposes, (2) isn't really affected useful but only here for completeness.

Align Painting with Hardware Vsync

Aligning the timer used to tick the refresh driver with vsync creates smoothness in a couple of ways. First, many animations are still done on the main thread, which means any animation using timestamps to set the position of an animation should be smoother. This includes requestAnimationFrame animations! The other nice thing is that we now have a very strict ordering of when rendering is kicked off. Instead of (1) and (3), which are related occurring at a synced offset, we start rendering at a specific time.

Resample Touch Input Events Based on Vsync

With Silk, we can enable touch resampling which improves smoothness while tracking your finger. Since I've gone over what touch resampling is quite a bit, I'll leave this short. With Silk, we can finally enable it!

Align Composites with Hardware Vsync

Finally, the last part of Silk is aligning composites with hardware vsync. Compositing takes all the painted content and merges it together to create a single image you see on the display. With Silk, all composites start right after a hardware vsync occurs. This has actually produced a rather nice side benefit: reduced composite times. See:

Within the device driver on a Flame device, there's a global lock that's grabbed when close to vsync intervals. This lock can take 5-6 ms to get, greatly increasing the composite times. However, when we start a composite right after a vsync, there is little contention to grab the lock. Thus we can shave off the wait, therefore reducing composite times quite a bit. Not only do we get smoother animations, but also reduced composite times and therefore better battery life. What a nice win!

With all three pieces, we now have a nice strict ordering of the rendering pipeline. We paint and send the updated content to the Compositor within 16.6 ms. At the next vsync, we composite the updated content. At the vsync after that, the frame should have gone through the rendering pipeline and will be displayed on the screen. Keeping this order reduces jank because we reduce the chances that the timers schedule each step at a bad time. For example, in the current implementation without silk, the best case would be that a frame could be painted and composited within a single 16.6 ms frame, which is great. However, if the next frame takes 2 frames instead, we just created extra jank even though no stage in the pipeline was really slow. Aligning the whole pipeline to create a strict ordering reduces the chances that we mis-schedule a frame.

Here's a picture of the rendering pipeline without Silk. We have Composites (3) on the bottom of this profile. We have painting (1) in the middle, where you see Styles, Reflow, Displaylist and Rasterize. We have Vsync, which are those small little orange boxes at the top. Finally we have Layer Transactions (2) at the bottom. First, when we start Compositing and painting are not aligned, so animations will be at different positions depending if they are on the main thread or compositor thread. Second, we see long composites because the compositor is waiting on a global lock in the device driver. Lastly, it's rather difficult to read any ordering or see if there is a problem without deep knowledge of why / when things should be happening.

Here is a picture of the same pipeline with Silk. Composites are a little shorter, and the whole pipeline only starts at vsync intervals. Composite times are reduced because we start composites exactly at vsync intervals. There is a clear ordering of when things should happen and both composites and painting are keyed off the same timestamp, ensuring smoother animations. Finally, there is a clear indicator that as long as everything finishes before the next Vsync, things will be smooth.

Overall, Silk hopes to create a smoother experience across Firefox and the web. Numerous people contributed to the project. Thanks to Jerry Shih, Boris Chou, Jeff Hwang, Mike Lee, Kartikaya Gupta, Benoit Girard, Michael Wu, Ben Turner, Milan Sreckovic for their help in making Silk happen.

Android's Touch Resampling Algorithm

After digging through the internet and some advice from Jerry Shih, I looked into Google Android's touch resampling algorithm. For those interested, it is located here. Android's touch resampling algorithm is amazing and hats off to the Googler who figured it out. Android uses a combination of touch extrapolation and touch interpolation. Touch interpolation means we take two touch events and create a single touch event somewhere in the middle of the two touch events. Touch extrapolation means we take two touch events and create a single touch event somewhere ahead of the last touch event or predict where a touch event will be. Let's take a look at our standard input from both the LCD refresh display rate of 60 hz and the touchscreen refresh scan rate of 100 hz. (For those who need catching up, please read the previous post).

We have touch input events coming in at every 10 milliseconds that moves by 10 pixels and a refresh display vsync event every 16.6 milliseconds. Android's algorithm, creates a new timing event called Sample Time, which is always 5 milliseconds behind the vsync event. Thus, the first sample time is at time st=11 ms (16 vsync time - 5 sample time), then the next sample time event at time st=27 ms (32 ms vsync - 5), and the next at 43 ms (48 - 5). Let's take a look at the same graph with the sample time:

The sample time is used to balance and determine whether touch events are extrapolated or interpolated. We only ever look at the last two real touch events, never at a sampled event. If the last touch event occurs AFTER the sample time, but BEFORE the vsync time, we interpolate. This occurs for example at time vsync t=32 ms. The sample time is time st=27 ms, but we have two touch events. One touch event at time t1=20 ms and time t2=30 ms. Since the last touch event, t2, is at 30 ms and after 27ms, we interpolate these two touch events. If the last touch event occurs BEFORE the sample time, we extrapolate touch events. For example, at vsync event t=48 ms, we have a sample time of st=43 ms. The last touch event is at time t=40 ms, which is before the sample time. We will use the last two touch events at time t1=30ms and t2=40ms to extrapolate the touch event for vsync event t=48. If there exists a touch event AFTER the sample time, not the vsync time, we interpolate the last two touches. If both touch events occur BEFORE the sample time, we extrapolate the last two touch events.


Let's take a look at the interpolation case. It is not a simple midpoint interpolation as it takes into account when the touch event occurs relative to the sample time. First, we calculate the amount of time that has passed between the two last touch events, let's call this the touch time difference. This should relatively stable per device. In our case this should always be 10 milliseconds. Next, we calculate the time difference between the sample time and the touch event that is BEFORE the sample time (touch sample diff). For our example at vsync t=32, the sample time is 27ms. The first touch event that is before the sample time is at t=20ms. Thus, 27 ms sample time - 20 ms touch time = 7ms. Next, we create a variable called alpha that is the touch sample difference / touch time difference. In this case, 7 ms / 10 ms = 0.7. Finally, we do linear interpolation with this alpha value as the midpoint modifier between the touch event after the sample time and the touch event prior to the sample time. Our two touch events were at time t=20, displacement d=20 and t=30 with a displacement of d=30. We start by using the first touch event before the sample and add an interpolated displacement to it. Thus, we have an interpolated displacement d=20 + (30 - 20) * alpha, which is 20 + (10 * 0.7) = 27. Thus at vsync time t=32, we send a touch event with a displacement of d=27. A larger alpha means we lean more towards the last touch event and a lower alpha means we lean more towards the first touch event. Here it is in equation form:

Here is the general equation. The LastTouch refers to the touch ahead of the SampleTime. The FirstTouch refers to the touch before the SampleTime.

Let's take a look at another example, at vsync time t=80. We have two touch events, one at time t=70 and another at time t=80. Our sample time here is 80 ms - 5 = 75 ms. Since we have one touch event t=80 that occurs after the sample time we interpolate. Here is the work:

We see that we have a final result of displacement d=75. That's it for interpolation.


Touch extrapolation occurs when the last touch event occurs BEFORE the sample time. This occurs at vsync time t=48. Our sample time here is 48 - 5 = 43 ms. We have two touch events one at time t=30 and another at time t=40 ms. Since both are before 43 ms, we extrapolate the touch events. The logic works similarly to the touch interpolation with a few differences. We still have the same touch time difference between the two touch events which is always 10 ms. Next, we calculate the touch sample difference, which is now the last touch event minus the sample time, thus we expect a negative number. Thus we take the last touch event t=40 - the sample time st=43 = -3. Next we calculate alpha the same way which is touch sample difference / touch time difference. This case it is (-3 / 10) = -0.3. Finally we again use the same linear interpolation equation, but because alpha is negative we will extrapolate. In addition, we swap the operands and subtract the first touch from the last touch and set the starting displacement to be the last touch. Thus unlike the interpolation algorithm, we start with the last touch event and add a displacement to that. Our final result is displacement d=40 + (30 - 40) * -0.3 = 43. Thus we extrapolated 3 pixels in this case. Here's all the math:

Here is the general extrapolation equation. The Last Touch refers to the most recent touch. First touch refers to the earlier touch event. We still use the last two touch events to extrapolate.


Let's do one more extrapolation case at vsync time t=128 ms. Here is the work for a final displacement of d=123ms.

Here is the final displacement figures if we do interpolation and extrapolation across our original input:

Wow. Just wow. Outside of the first frame because we have no extra touch events, it hits the ideal case of moving a displacement of 16 pixels every 16 milliseconds for a perfect smooth scroll. It has a Frame Uniformity standard deviation of 0. What's even more amazing is that these formulas to calculate touch interpolation and extrapolation works across different touch screen refresh rates. For example, Mozilla's Flame reference device has a touch refresh rate of every 13 ms instead of every 10 ms like the Nexus 4 and it still works to create a perfect smooth scroll. Even more amazing, this algorithm accounts for touch driver noise because real devices don't send a touch event at exactly perfect 10 ms intervals, they jitter so you get touch events at 9.9ms or 10.5ms.

How does it feel in real life? Since we actually extrapolate touch events, if we're scrolling, it feels more responsive since it seems to keep up with your finger better. In addition, flings feel faster making the whole system feel not necessarily smoother but more responsive. While tracking your finger, or doing a slow scroll, we get the additional benefit that scrolling feels smoother as well. In all aspects, compared to the other touch algorithms, this is superior. On fast scrolls it keeps up with your finger better than the previous sampling algorithms. On slow scrolls, it smooths out the touch events creating a smooth experience. The only downside is that if you change your fingers direction, it will be a bit more jarring if we extrapolate then change directions. However, in real life I only barelynoticed it rarely and only because I was looking for it. Overall, a very nice win.


Smooth Scrolling, Frame Uniformity, Touch Interpolation and Touch Responsiveness

I hope you have coffee, because this one is going to be a long one. When scrolling a webpage on a device, how smooth is it? Smoothness is how little jerk do you see while scrolling the webpage. The big problem with this measurement is that it is somewhat subjective, it just "feels" smoother. However, we can actually measure it with a metric called Frame Uniformity. Frame Uniformity is a measure of how smooth a page is scrolling and to have a high Frame Uniformity, lots of things have to be right. The essential measure is:

If we have a constant drag of [A] px per unit of time, the screen should also have a constant drag of [A] px per unit of time.

For example, if my magic hand could scroll a web page at a constant and consistent rate of 1000 pixels per second (I'm human and this doesn't work, but we have tools), the webpage on the screen should scroll at a constant rate of 1000 pixels per screen. Since a screen refreshes at 60 hz for 60 frames per second, or every 16.6 ms, that means at every screen refresh, or every frame, the screen should scroll by 16.6 pixels every frame. This would be a perfect smooth scroll and have the highest mark on Frame Uniformity.

The vertical bars are vsync events, which is when the display refreshes and shows a new frame. The number above the vertical bar represents the absolute position shown to the user at that vsync event. The horizon bar represents the displacement, or the amount a user has perceived to scroll, between the two frames. Thus, ideally we'd have a displacement in increments of 16 pixels every 16 milliseconds, with a difference of 16 pixels per frame. (Rounded to ints for simplicity). Visually, imagine that if each Firefox logo represents one frame, it would be a perfectly smooth scroll:

Unfortunately, due to the real world, it is very difficult to achieve a perfect 16.6 pixel scroll with a 16.6 constant drag rate. There is skew all along the system in the hardware, touch driver, the system processing, etc. My understanding is that no device on the market achieves perfect Frame Uniformity. However, low Frame Uniformity equals janky or jerky scrolling, which is not something we want when trying to deliver a high quality product. So what are the problems that create bad Frame Uniformity and how do we solve them?

The biggest problem with Frame Uniformity and smooth scrolling is that the display screen and touch screen refresh at different rates. The display refreshes at 60hz, or every 16.6 ms whereas most commodity touch screens refresh at 100hz, or every 10ms. The touch screen refreshing at 100hz means that it scans the touch screen for touch input every 10ms. Note, these are "estimates" and have skew. For example, the touch screen could scan at 9.9 ms, or 10.1ms, or 9.5ms depending on the quality of the hardware. The better hardware, the lower skew. Thus, we have an uneven distribution of touch events versus display refreshes, which is a big source of jank.

Consider the average case. The display refreshes every 16.6ms and we perfectly dispatch to the touchscreen a touch move scroll of 10 pixels every 10 ms. I will round the 16.6 ms to 16ms just to make things easier. Our input from the hardware means we will have a graph like this:

We get a new touch event at a displacement of d=10 pixel increments every 10 ms. The vysnc event is when the hardware display refreshes the screen. What we see is that in some cases, we only have to process 1 touch event, e.g. at vsync event time t=16ms, and a displacement of 10 pixels. At the next vsync event at time t=32, we see we have two touch events. One with a displacement of d=20px and another with a displacement of d=30 pixels. At this one vsync event at t=32, we have to process two touch events but can only display one! At the first vsync event of t=16, we only have to process one touch event with a displacement of d=10.

Since we can actually only process one touch event per frame, we can just take the last touch event of a displacement=30 at time t=32. But what does this mean? It means at the first vsync event of t=16, we scrolled by 10 pixels. At the next vsync event of t=32, we scrolled by 20 pixels. (First frame of t=10, we were at 10px, then at 30px at t=32. 30 - 10 = 20 pixel scroll difference). If we extrapolate this a couple more times across the next few scrolls, we start to see a pattern. At the vsync event of t=48, we have one touch event of displacement 40, so we move to pixel 40. This means a difference of (40 - 30 = 10), or 10 pixels difference in one frame. So at the first frame, we moved 10 pixels. The second frame, 20 pixels, and the third frame 10 pixels. Remember that the ideal was 16.6 constant drag per frame. What we have now is an alternating pattern of 10 pixels, to 20 pixels, to 10 pixels. Here is the whole extrapolation:

This touch sequence visually looks like this to the user:

This isn't smooth at all! It's pretty jerky! Actually it's measurable to have a Frame Uniformity standard deviation of 5. What this means it the standard deviation between frame displacements across this interval is 5. The ideal case would be 0, meaning no differentiation and perfectly smooth. Not too bad but not great, What do we do!?

Touch Interpolation

This is where touch interpolation comes into play. The idea of touch interpolation is to smooth out the input's refresh rate and match it to the display driver's refresh rate. It is averaging out the touch events, coalescing them into one touch sample. This has two benefits. First, the system won't have to respond to touch events that won't be shown on the screen, which reduces system load. Second, the touch data can be manipulated to smooth out the choppiness on the screen, increasing Frame Uniformity. The first question though is why touch interpolation and not touch extrapolation? Touch extrapolation tries to predict where the touch event will be, and create one touch event with predictive touch behavior. The problem with extrapolation is that when a user swipes back and forth quickly, we can overscroll past what the user actually touched. The good thing about touch interpolation is that we never create a touch event that the user's actual finger wasn't on. Every interpolated touch will be a touch event on a path that the user actually had their finger on.

Great, so we want touch interpolation, what do we do? Do we just take two touch inputs and make them into one? Seems simple enough right? Let's see how that works out.

Midpoint Touch Interpolation

The first algorithm we introduce is a basic touch interpolation algorithm. If we have two touch events in one vsync event, we take the midpoint and coalesce the two touches into a touch. If we have one touch event in a vsync event, we just dispatch the one touch event. So for example, at vysnc time t=16, we dispatch a touch event with displacement=10. At vsync t=32, we take the midpoint between touch events [d=20, d=30] to create one touch event with displacement d=25. What we see is actually quite an improvement! We have a series of dispatching touch events with a difference of 15 pixels, with sometimes a jump to moving by 20 pixels. This has a Frame Uniformity of 2.2, which is much better than the standard deviation of 5! So overall, this should translate into a smoother scroll. Visually, it looks something like this:

Overall, a nice improvement in smoothness relative to the current original problem. However, can we do better?

Midpoint of Last 2 Touches

Can we use previous touch events to smooth out the difference in frame displacement? Can we use the past to help us tweak the future to stabilize big changes? One example algorithm would be that if we do not have two touch events, we can use the previous frame's last touch event to create a sample for the current frame. Thus, we always use the last two touch events and use the midpoint to create an interpolated touch event for the current frame. We never interpolate with an interpolated touch, we just use the previous two real touches. If a vsync event only has one touch event, we use the current touch event and the last touch event from the previous frame. If the current vsync event has two touch events, we create a sample from the two current touch events. The intuition behind this is that if we have a big change, we can use the previous touch event to smooth out the big change and create a less noticeable jank. How does this look:

Visually, this looks like:

Interestingly, this has a standard deviation of 4.86, which is almost as bad as the current situation! It actually doesn't smooth anything out in this case and instead continues the alternating pattern of 10 pixels and 20 pixel displacements we had before. Bummer.

Midpoint of Last Two Touches With Sampling

What about trying to use a previous interpolated touch to smooth out the touch events? This version always uses the last two touch events and creates a single touch using the midpoint if they exist in the current vsync. If the current vsync only has one touch event, we interpolate the current touch event plus the previous frame's sampled touch event. This is a slight variation of the Midpoint of the Last Two Touches algorithm in that we now incorporate previous samples, not just previous touch events. How does this look:

Wow we made it worse! We see large jumps of up to 23 pixels and really small jumps of only 7 pixels. The standard deviation between frames jumped to 7.27. We made it worse! Ok, we probably shouldn't do extra damage here.

Last Sample and Last Touch

In this last algorithm, we touch interpolate the previous frame's sample and the current frame's latest touch event. If we have two touch events in a single vsync, we use the latest touch event plus the previous frame's sample, ignoring the middle touch event. If the vsync has one touch event, we touch interpolate the current frame's touch event with the previous frame's resampled touch. How does this look:

Wow, that looks a bit smoother huh. Interestingly, it lags behind a frame in the beginning since we resample the previous frame, and we see a nice oscillating frame displacement of ~13-17. We see that it improves the Frame Uniformity standard deviation to 3.05.

If we evaluate all the algorithms, using only the Frame Uniformity standard deviation metric, it looks like just using the midpoint of the last two touches wins right? It has the lowest standard deviation at 2.2. The next best one, Last Sample and Last Touch(LSLT) has a standard deviation of 3.05, so should be an easy win? Almost, if only it was that easy.

Evaluating Midpoint versus Last Sample and Last Touch and Touch Responsiveness

If we look closely at using the Midpoint algorithm, what we see is a consistent drag of 15 pixels, one pixel behind the ideal 16 pixels per frame, with a jump up to 20 pixels every 3-4 frames to catch up. When we look at the Last Touch and Last Sample algorithm, we see a huddle of 15, to 17, to 14, to 17 pixels. The LSLT algorithm is better at keeping up with the ideal 16 pixels and doesn't fall behind as quickly.

Visually, with the Midpoint algorithm, we'll see perfect smoothness then a single jank, then back to smoothness. However, since this occurs every 3-4 frames, or every 48 - 60 ms, this visually still looks like jank quite often. With the Last Touch and Last Sample algorithm, we have a constant jank of 1-3 pixels per frame offset from the ideal. However, visually, LSLT is smoother because the difference between frames is less. For example, with the Midpoint algorithm, we have one large jank of 5 pixels every 3-4 frames. With the Last Touch and Last Sample algorithm, we're off of the ideal by 1 pixel one frame, then 1 pixel again, then 2, then 1. The jank is amortized over a couple of frames.

Imagine watching a car driving in a perfectly straight line. If the car swerved just a little bit for one second then went back to a perfect straight line, you can easily notice the car driving out of the lane. However, if the car was always alternating just a little but, but was mostly straight, each individual change mostly indiscernible, it would seem smoother. This is kind of the difference between the two algorithms.

Numerically, this is an interesting difference between the Last Sample and Last Touch (LSLT) algorithm and the Midpoint algorithm. Remember that the ideal displacement is in increments of 16 pixels per frame. Let's take a look at the absolute displacement of the ideal case, the Midpoint algorithm, and the Last Sample and Last Touch algorithm:

Ideal:         [16, 32, 48, 64, 80, 96, 112, 128, 144, 160]
Midpoint:  [10, 25, 40, 55, 75, 90, 105, 120, 135, 155]
LSLT:         [10, 20, 30, 45, 62, 76,  93,  106, 123, 141]

Hmm, it looks like the Midpoint algorithm is much better at tracking the ideal case. However, the numbers from Last Sample Last Touch look pretty interesting. They look close to increments of 16, just one frame behind. Let's take a look again by trailing one frame behind:

Ideal:     [16, 32, 48, 64, 80, 96, 112, 128, 144, 160]
LSLT:     [20, 30, 45, 62, 76, 93, 106, 123, 141]

Wow that is really close to ideal isn't it! Let's see how much closer. Let's take the difference from the ideal for each frame. For the Midpoint algorithm, we'll match it with the current frame. For the LSLT algorithm, we'll match it with one frame behind. For example, for the second frame, we'll take 32 - 25 = 7 frame difference for the Midpoint algorithm. For LSLT, we'll take (32 - 30 = 2) frame difference.

Midpoint: [6, 7, 8, 9, 5, 6, 7, 8, 9, 5]. Average = 7 px.
LSLT:        [4, 2, 3, 2, 4, 4, 6, 5, 3]. Average = 3.8 px.

With the LSLT algorithm, we're very close to an ideal scroll, just one frame behind, with an average of 3.8 pixels away from the ideal. In addition, if we measure Frame Uniformity by disregarding the first two frames between the Midpoint algorithm and the Last Sample Last Touch algorithm, we see an improvement in Frame Uniformity as well. Frame Uniformity for the Midpoint algorithm increases from 2.2 to 2.44 whereas LSLT decreases from 3.05 to 1.86, outperforming the Midpoint algorithm. Thus the LSLT algorithm has a few interesting characteristics. The first couple of frames will be worse but the middle and end will be much better. Since an average scroll takes a couple of hundreds of milliseconds, a user would only see a couple of early frames as slow, then overall much better. While in the middle of a scroll, we will track your finger pretty well, and pretty smoothly. The trade off is that we'll add 16.6 ms of latency and be somewhat behind in terms of displacement. Overall, scrolling will be smooth, but it will feel a little less responsive when tracking your finger.

Here are is a visual comparison of the last few frames comparing the ideal, the Midpoint algorithm, and the Last Sample Last Touch algorithm in terms of smoothness.


Does any of this matter? Does a Frame Uniformity decrease of 5 to 2 actually mean anything? What about adding one frame of latency? Below is a video of the Last Sample Last Touch interpolation algorithm on a Flame device with and without touch interpolation. Make sure you push the HD button. *Hint: Try to see which one you think is smoother and the answer for which device has touch interpolation is at the bottom. It's difficult to see the "smoothness" on a video, but in person it is very noticeable. Since I'm swiping my finger back and forth, you can see the touch latency as well. Is it worth it to trade smoothness for latency? I'm not sure, but at least it's smooth.

The last question is, is it noticeable in day to day use? Smoothness while tracking your finger is dramatically improved. Smoothness while flinging or already in a scroll is an animation and is somewhat affected by this. Scrolls will be appear slower because we think we're scrolling slower and thus calculate a slower velocity. Smoothness while slowing down a scroll is unaffected as that is also an animation. However, in terms of smoothness while tracking your finger, it's pretty awesome.

Make sure to push the HD button for moar HDs!


All the algorithms on one giant graph:

* The device on the left has touch interpolation.

A New Wall

Last time, we destroyed the wall. This time, we rebuild it. I'm so glad I hired someone to rebuild the wall. I would have died or destroyed my wall had I done it myself. There were so many little things I would have never paid attention to had I tried to do it myself.

  1. If you destroy a wall, it's easy to move electrical outlets wherever you want. You can also remove electrical outlets.
  2. Each drywall piece can be textured to be flat or have some bumps. I just matched my current wall which has some texture, but I didn't think about it.
  3. Drywall has to be painted. Sounds stupid, but I didn't think about painting before. I thought you just throw up some wall.
  4. Since some portions of the wall are shorter than they were originally, some of the subfloor is exposed. Now I have to hire someone to rebuild and repair parts of the floor.
  5. Drywall also has to be finished so that paint can actually go on top of it. You can't just paint drywall.

Onto some pictures.

Roxul Safe n Sound added in between the studs.

Roxul Safe n Sound added in between the studs.

One Whisperclip Installed.

One Whisperclip Installed.

The Hat Channel Installed on the clips.

The Hat Channel Installed on the clips.

QuietPutty behind the electrical outlets. Interestingly, the neighbor had no electrical outlets on his side.

QuietPutty behind the electrical outlets. Interestingly, the neighbor had no electrical outlets on his side.

One layer of drywall up.

One layer of drywall up.

Two Layers of drywall up with green glue. The red lines are where the hat channels were. We missed a couple of times.

Two Layers of drywall up with green glue. The red lines are where the hat channels were. We missed a couple of times.

Drywall textured and finished. Now ready to paint.

Drywall textured and finished. Now ready to paint.

Wow. Such Checkerboard

Since Async Pan and Zoom (APZ) has landed on Firefox OS, scrolling is now decoupled from the painting of the app we're scrolling. Because of the decoupling, when the CPU is under heavy load, we sometimes checkerboard. What do the past two sentences mean and how do we fix it? Stay with me here as we go down the rabbit hole.

What is Checkerboarding?

Checkerboarding occurs when Gecko is unable to paint the viewable portion of a webpage while we're scrolling. Instead, we just paint a color instead of the content. Visually this means we get something like this:

The blank white space at the bottom underneath 'David Ruffin' is checkerboarding.

The blank white space at the bottom underneath 'David Ruffin' is checkerboarding.

Why does this happen?

Async Pan and Zoom

Asynchronous Pan and Zoom (APZ) is a major new feature in Firefox OS that improves scrolling, panning, and zooming. Before APZ, scrolling was a synchronous affair. Every time a user touched the screen to scroll, an event would be sent to the browser essentially saying 'scroll by X amount'. Nothing else in the system could occur while the CPU painted in the newly scrolled in region. If the CPU was busy doing something else, you couldn't scroll until the CPU caught up or it would be janky.

APZ changes that by using a separate Compositor thread. Every time a user touches the screen, an event is fired off to the Compositor thread that says 'scroll by X amount'. The compositor thread is then free to scroll by X amount without being blocked by other work. For this to work smoothly, the graphics system has to overpaint what the user is currently seeing. The idea is that the user can scroll around a displayport that is larger than what the user is currently seeing (the viewport). When the user scrolls near the edge of the displayport, we can repaint the displayport again and the user scrolls as they wish. The painting of the displayport occurs on the main thread, but the scrolling occurs on the Compositor thread. Thus, we should be able to have smoother scrolling. It's all kind of difficult to explain in text, so let's checkout this video:

What we see is how the graphics subsystem works when we're scrolling. The initial 'red' box we see is the current viewport, or what the user is seeing. The darker brown box is the whole webpage. As we scroll, we see a yellow box, which is the 'requested' displayport. This is a request to the graphics system to render and paint that portion of the webpage. Finally, the green box displays what we've actually painted. As we scroll, we see that we're constantly requesting new display ports (yellow boxes), and we're flashing green as we paint in new displayports. These displayports should always be following the viewport and in theory, the viewport always fits into the green box. When they don't, we get sadface or checkerboarding. This occurs around frame ~440 (top left number) where the top of the red box is just slightly outside the green box. Not quite the best visual appearance ever is it? Finally, we know what checkboarding is and why it happens. How do we fix it?


The Graphics Pipeline

Figuring out how to fix something requires an understanding of what's actually happening. Your HTML page goes through these following steps:

  1.  Parsed HTML into a Content Tree - Mostly a 1:1 representation of your document
  2. Content Tree is translated into rectangles creating the Frame Tree
  3. Each node in the Frame Tree is assigned a Layer, creating a Layer Tree
  4. The Layer Tree is painted. The content is sent to the Compositor in another thread
  5. The Compositor renders each layer to create a single image you see on the screen.

Steps (1) and (2) occur the first time the page is loaded. Steps 3-5 occur at every single frame. If you want smooth scrolling, we paint 60 frames per second, which means we might have to do steps 3-5 every 16.6 ms consistently. When we checkerboard, it means we're not doing steps 3-5 within our 16.6 ms time frame. Since steps 1 and 2 occur only once, if we want to optimize smooth scrolling, we have to optimize steps 3-5. As with compilers, optimizing early in the pipeline usually produces a much better performance optimization than optimizing at the end of the pipeline. For example, the compositor's job is to go through each layer and draw pixels. If we can optimize a layer, the compositor doesn't have to do any work. The main way to optimize checkerboarding and smooth scrolling is to optimize the layer tree. If we optimize a layer in the layer tree, we don't have to paint anything in the main thread, we send less data over IPC to the compositor thread, and the Compositor has to do less work. Now, what is a Layer? What's a Layer Tree. And what is ???? before we Profit!

Layerize All the Things!

Every node in the Frame tree is assigned to a Layer. There are different layers for different types of content, such as an image, a video, a scrollable layer, etc. All the Layers make up the Layer Tree. The idea of a layer comes from painting by hand, where oil painters used layers for different elements in their paintings. The decision to assign content to which layer occurs in nsFrame::BuildDisplayList. At each frame, Gecko decides to invalidate certain parts of the Layer Tree and assigns those elements to a new layer. What does a Layer Tree look like?


We see that there are a ton of different types of Layers, but there are a few that stand out.

  1. The RefLayer - This is the layer that contains the content app or child process. Everything below it belongs to the content. Everything above it refers to the parent.
  2. ContainerLayers - These layers just contain other layers. Sometimes these are Scrollable layers, as identified by a ScrollID.
  3. ThebesLayers - Layers that have an allocated buffer that we can draw to. We mostly want to optimize these!
  4. ColorLayers - Blank layers made of a single color (usually specified with the CSS property background-color).

To see a text dump of a Layer Tree, enable the layers.dump preference or in the developer menu in Gaia. Our job to reduce checkerboarding is to make sure we both (1) Layerize Correctly and (2) Make as few changes to the Layer Tree as possible at each frame. How do we know when we're doing (1) or (2)?

Layerizing Correctly

There are a few things to check to see if an app is layerizing correctly. The are generally just alerts to see if we're doing ok. The first is to enable the FPS counter in the developer tools. The right most number on the display will tell us how many times we're overdrawing each pixel. In the ideal world, we would draw each pixel only once, so the number would be 100. If the number is 200, it means we're drawing each pixel twice. While you're scrolling, if the number is ~300 - ~400 for an app, we're probably doing alright. Anything over ~500 should be an alarm that we're not layering correctly.

The other option is to enable Draw Layers Borders. If we have a lot of random layers all over the place while we're scrolling, or if the layer's don't make any sense, we're buildings bad layers. Here's an old example from the Settings app where we were layerizing pretty badly. See the random green boxes around sound? Bad.

Over Invalidation

The next thing to check is to see if the app is over invalidating the layer tree. In those cases, we'll be painting something every frame when we don't have to. For example, if a piece of content is a static block with some text that never changed, it's useless work to keep painting the same text. You can check to see if your app is over painting by enabling the 'Flash Repainted Area' option. Generally, unless it's changing, it shouldn't be flashing. In addition, with APZ, when we're scrolling, only the newly scrolled in content should maybe be flashing. If everything is flashing all the time, check your app. Some good examples on over invalidation are over here.

You should check that we're both layerizing correctly and not over invalidating the app. If both look good, you might have to read a layer tree.

Reading a Layer Tree

Reading the Layer Tree is a very dense and time consuming process, especially at the beginning. Skip this if you're busy and just want to fix your app quickly. Otherwise, grab some coffee and join me for a rabbit hole. Here is an example Layer Tree from the Settings App that helped with bug 976299.

Essentially what we're looking for is a Layer that is either (a) unused (b) not visible or (c) not the right dimensions. The key things to look for are the visible sections. These dimensions are in pixel units. On this device, we're looking at a width=320px, height=480px screen. Thus anything that isn't actually visible to your eyes shouldn't be in a layer if we're optimizing it correctly. Let's take this layer tree line by line.

  1. Layer Manager - Every Layer tree is managed by a Layer manager. Can't do anything here.
  2. Because we haven't seen a RefLayer yet (line 14), we're now in the parent process. We have a ContainerLayer that is dimensions width=320, height=480, starting at coordinate x=0, y=0 (the top left corner). X is the horizontal axis. Y is the vertical axis. We see this in the [visible=< (x=0, y=0, w=320, h=480); >] portion.
  3. We have a ThebesLayer, with x=0, y=0, w=0, h=0, and it is not visible. Ideally, we wouldn't have a Thebes Layer here, but since it has no dimensions and is not visible, we're good to go.
  4. Empty Space
  5. We have a ColorLayer. ColorLayers are a single background color layer. Here in this case it is a black background rgba=(0,0,0,1), the 1 here means opacity so it's opaque. It's also not visible, so we're good to go. Opaque is good because it means anything behind this layer isn't going to be visible, so we don't have to paint the items behind it.
  6. Another ContainerLayer that has the same dimensions. Since the layer in line (2) is just a non-visible rectangle, it's ok to have this one.
  7.  We have a ThebesLayer that is width=320, height=20px located at (0, 0). Since this is in the parent process, this is the status bar that tells you how much battery you have, etc.
  8. Every Thebes layer has some buffer attached to it to actually draw the data. The ContentHost tells us it has a buffer of size width=320, height=32, located at (0, 0). In the ideal world, the buffer size would be height=20, but 32 works because of the GPU.
  9. The ContentHost has a gralloced buffer that we're painting with OpenGL. Usually, for every ThebesLayer, you should see both a ContentHost (8) and a Gralloced Buffer (9).
  10. Another Container Layer! This time located at (0, 20) (so vertically down 20 pixels, visually 20 pixels from the very top or just below the status bar), a size width=320, height=460. The height here is 460 because 480px total screen size - 20 pixels for the status bar.
  11. Another Thebes layer, not visible so we're ok.
  12. Empty
  13. A Color Layer that is the whole container layer (10). Color Layers are cheap, so it's ok. In the ideal world, we'd get rid of this too. But the important part of the Container Layer is line 14!
  14. RefLayer - This marks the beginning of the Content App, or the Settings App here. We see it starts at coordinate (0, 20), is 320x460.
  15. The Container Layer here starts at coordinate (0, 0), width=640, height=460. A few things to note here, since the ContainerLayer is inside the RefLayer, the (0, 0) is from the top left of the ContainerLayer in (10). That means, from the point of view of the whole screen, it's actually at (0, 20), width=640, height=460. In addition, it has a DisplayPort of size 640x460, which is the displayport associated with Async Pan and Zoom. Since our screen size is only width=320, we're allocating twice as much size as we need to. ALERT SHOULD BE GOING OFF HERE!
  16. We have a ThebesLayer that starts at x=320, y=0 (again, from the point of view of the ContainerLayer defined in (line 10)), so it's actually at coordinate (320, 20). The width=320, height=460. Essentially, we're having a layer that is horizontally shifted by 320 pixels. But our screen is only 320x480. We have a layer for something off screen! ALERT! See bug 976299.
  17. A Content Host for the Thebes Layer in 16, again at position (320, 0).
  18. The buffer for the Content Host and Thebes Layer.
  19. Another Container Layer! Yay! This time at (0, 0), width=320, height=460. This Layer is for the word 'Settings' and might be too big.
  20. Another ThebesLayer, (0, 0) 320x460. Again in reality, it's at (0, 20), but because it's inside line 10, it's from the point of view of shifted down 20 pixels.
  21. Another Content Host.
  22. Another buffer for the Thebes Layer in line 20.
  23. Another Container Layer. However, this is the layer we're actually scrolling. First, we see that it has dimensions (0, 50) width=320, height=1435. Why is it height=50? What's at (0, 0) in the Settings app? Oh, the word 'Settings'! Thus the layer at line 19 contains the word 'Settings'. This layer at line 23, is the scrollable area of the Settings App. The height=1435 is because of APZ. Remember the displayport being bigger than the viewport? Here's how big the displayport is! 1435 pixels. We see that this height=1435 matches the displayport size [displayport=(x=0.000000, y=0.000000, w=320.000000, h=1435.000000)].
  24. A Thebes Layer for the whole displayport, hence 320x1435 size. We also see again that the starting coordinate is (0 ,0), which is again relative to the ContainerLayer in line 23. So (0, 0), means the top left of the ContainerLayer in line 23, or in overall coordinate space (0, 70). 20 from the status bar plus 50 from the word 'Settings'.
  25. A ContentHost for the ThebesLayer in line 24. The same size.
  26. A buffer for the ContentHost for the ThebesLayer in 24.

Whew. Thus when we're optimizing the Layer Tree, we're looking for items that are allocated a layer when they shouldn't be or wrong sized layers. One thing to note here is that every layer here is Opaque. If it is transparent, we'll see [component-alpha] instead of [OpaqueContent]. Opaque content is great because we don't have to paint any items behind the opaque content. If you were wondering why we were adding a bunch of background-colors into Gaia, this is it. Whew, so that's line by line on how to read a layer tree. Essentially, if you see a layer but don't see it on the screen, we can optimize a bit more. Every time we eliminate a Layer, it helps the compositor at every single frame, or every 16.6ms.

Are We There Yet?

In theory, if we fix everything, and we're not doing any other work while scrolling, we should not checkerboard. We've made a lot of progress in many of the apps. You can find the mother bug in bug 942750. Until then:


Special thanks to the Graphics, Firefox OS Performance, and Gaia teams for their hard work at fixing the problem. Special shout out to BenWa, Ben Kelly, and Vivien for their help.