A Quick Overview of Chrome's Rendering Path

I just wanted to see what Chrome was doing for it's rendering compared to Firefox, especially after Slimming Paint. Chrome has three main processes that interact to render a web page.

  1.  The Renderer process - This is what you think of when you hear about multi-process Chrome. Each tab has it's own renderer process.

  2. The GPU process - This is a sandboxed GPU environment where Chrome issues all hardware accelerated GPU calls to composite.

  3. The UI process - One process that renders the UI such as the tabs at the top of the browser.

From a high level overview, content is first rendered in the render process (1). The rendered content is shared with the GPU process to do the actual composting (2). The GPU process then issues opengl calls to composite the content along with the browser UI on the screen. The most interesting stuff happens in the render process. The overall flow in the renderer process as of today to render a webpage (lots are changing in Slimming Paint) is:

  1. On the main thread of the renderer process, rendering starts during parsing of the HTML, which creates a DOM tree.

  2. These DOM nodes mostly get turned into LayoutObjects, which know how to paint themselves. These get layerized into multiple different kinds of layers you can read up on here.

  3. GraphicsLayers paint the items off the LayoutObjects into a blink::Display Items. When content gets painted, Chrome issues Skia draw calls to draw the content. But the content isn't actually drawn, it's recorded into an SkPicture. Most content of the display items in Chrome are wrappers around SkPictures or map to the Skia API.

  4. The generated blink::DisplayList is then translated to Chrome Compositor (cc) Display Lists, which are roughly the same, but no longer contain references to the LayoutObjects. Yes, there are two display lists.

  5. This cc::display list is shipped to the compositor thread of the renderer process. Using a work queue, multiple compositor threads replay the recorded Skia drawing commands to draw to 256x256 tiles.

  6. The resulting tiles are then composited with hardware acceleration in the GPU process. The GPU process does the tile GPU upload.

Let's take a look at this with a real example. Consider a simple page with just some text and a 256x256 image:

Hello World
<img src="firefox.png" />

This generates into a blink DisplayList as such:

[{index: 0, client: "0x1cd204404018 LayoutView #document", type: "DrawingDocumentBackground", rect: [0.000000,0.000000 1050.000000x755.000000], cacheIsValid: true},
{index: 1, client: "0x1cd204438010 InlineTextBox 'Hello World'", type: "DrawingPaintPhaseForeground", rect: [8.000000,8.000000 80.000000x18.000000], cacheIsValid: true},
{index: 2, client: "0x1cd204428018 LayoutImage IMG", type: "DrawingPaintPhaseForeground", rect: [8.000000,42.000000 256.000000x256.000000], cacheIsValid: false}]<p>Hello, World!</p>

The index here where the item exists in the display list. The client is what the actual display item is there to paint, and is a reference to a LayoutObject. The first LayoutView #document is just the empty white background we have in a simple page. The "type" is the only unique thing about the blink display item, which is which phase these items should be painted. Painting in this context means going through the Layout tree and generating blink::DisplayItems, which are wrappers around an SkPicture. An SkPicture is just a recording of draw commands such as "draw text" or "draw image" which map to the Skia API. Thus unlike Gecko, blink doesn't draw from the display items. Instead, display items are the result of a recording of draw commands issued by painting items recursively down the layout tree. Once we have the blink display items, these are translated into cc display items, which are also just wrappers around the recorded SkPicture. Once the translation is done, the main thread then notifies the compositor thread that there is some work to do.

The compositor thread actually rasterizes items on the screen based on 256x256 tiles. The compositor thread itself also doesn't actually replay the SkPicture draw commands, but instead sets up a work queue with items to draw. Then multiple compositor tile worker threads take an work item and replay the SkPicture draw calls onto an SkCanvas. In our example, we have 3 display items but all items are painted onto one SkCanvas. Once the tile worker threads have finished rasterizing the cc::DisplayItems, Chrome can start compositing everything together. Chrome by default on Windows doesn't use GPU acceleration to rasterize content, but it does for compositing. Remember the GPU process at the beginning though? All GPU commands must be executed in the GPU process so that when the GPU driver does something bad (and it happens too often), it doesn't take down the browser. To do this, Chrome uses shared memory with the GPU process. The GPU process reads the final rasterized image and uploads it to the GPU. Then the GPU process issues GL draw calls to draw quads, which is just the rectangle containing the final bitmap image for the tile. Finally, Chrome's ubercompositor composites all the images together with the browser's UI.

A Short Walkthrough of WebRender 2

I've been learning quite a bit about WebRender, a new research rendering engine for web content. At its core, WebRender philosophically uses the GPU to draw web content instead of the CPU which has traditionally been used. Currently Firefox uses an immediate mode API to draw content (Note immediate / retained mode are very broad and overloaded terms). An immediate mode API usually requires a context object, the programmer sets different states on the context object, and then you draw things onto the screen one item at a time. A retained mode API, which WebRender is using, is slightly different. Instead, rendering is done in two passes. First, you set up and upload data to the GPU and issue draw calls to the GPU to render, but nothing is actually drawn yet. Next, you tell the GPU "draw everything I told you about". This is a fundamentally different way to think about programming, which has actually been wrecking my brain the past couple of days. WebRender itself takes as an input display items and outputs pixels to the screen. It takes a complete description of the layout up front and analyzes how best to optimize the commands to send to the GPU.

Let's take a high level overview of what's going on with WebRender.

We got a couple of steps here that we'll delve into a bit further, but the overview is:

  1. Servo performs layout, which generates Display Items. Display items are just descriptions of what to render like "Text" and "Images".
  2. WebRender takes these display items and converts them to a Primitive type, which is a 1:1 representation from layout's Display Items. These could technically be the same as the Display Items in (1), but sometimes it's easier to deal with a different data structure.
  3. WebRender assigns these Primitives to Layers, which are not the same concept as layers in Gecko. Instead Layers here are Stacking Contexts.
  4. We then generate screen tiles, which at the moment are uniform 64x64 device pixel tiles.
  5. Assign the Primitives in each Layer to it's appropriate screen tile.
  6. Break down the Primitives per tile into Packed Primitives. Packed Primitives are the same Primitive information broken down into GPU uploadable data.
  7. Update the list of GPU resources that need to be updated.
  8. Notify the Compositor thread (which is the main thread), we're ready to render.
  9. The main thread / Compositor issues OpenGL draw commands and updates GPU resources based on the information from (7). All actual GPU interactions occur on the main thread.
  10. ???
  11. Profit!

Let's walk through what is actually happening for a small test case.

div {
  width: 100px;
  height: 50px;
  background-image: url("50px.png");

In this HTML test case, all we're trying to do is render an image onto the screen. Currently, WebRender only works inside Servo, so we have to load the test case through Servo to do actual layout and generate the Display Items for us. We can tell Servo to dump the display list for us and we get this:

┌ Display List
│  ├─ Items
│  │  ├─ SolidColor rgba(0, 0, 0, 0) @ Rect(1024px×740px at (0px,0px))
│  │  ├─ SolidColor rgba(0, 0, 0, 0) @ Rect(1008px×50px at (8px,8px))
│  │  ├─ SolidColor rgba(0, 0, 0, 0) @ Rect(200px×50px at (8px,8px))
│  │  └─ Image @ Rect(200px×50px at (8px,8px))
│  ├─ Stacking Contexts
│  │  ├─ Layered StackingContext at Rect(1024px×740px at (0px,0px)

We have three SolidColor rects along with one Image item within one Stacking Context. Let's focus only on the Image display item for now. We start by iterating over each Display Item given to us by layout and converting the Display Item to a Primitive type. Adding an image converts the Display Item information to the ImagePrimitive. Next, we assign these Primitives to Layers, which are 1:1 to Stacking Contexts. From the display list dump, we only have one Stacking Context so we'll only have one Layer in this case. However, the wrinkle here is that we do this based on tiles. WebRender splits the screen into uniform 64x64 tiles. For each tile, we check if the Layer intersects the tile. If so, for every Primitive in the layer, we check whether a Primitive intersects the tile, and so store which Primitives exist on the tile. This makes it easier for every tile to know what it needs to actually render.

Once we have all the Primitives for each tile, we go through each tile and create the PackedPrimitive that's local to each tile. Let's assume for simplicity that device pixels are equal to CSS pixels, e.g. 1px in CSS specification means 1 physical pixel on the monitor (which doesn't always happen, but it's easier for this). For example, since this test case case is 100x50, this test case will span two 64x64 tiles horizontally and 1 tile vertically for 2 total. The PackedPrimitive has information about which tile it's assigned to which will be used later in the shader. What's critical here is that WebRender will upload the same PackedImagePrimitive N number of tiles times, with really the only difference being the tile id. In addition, it's uploading some data called st0 and st1. These are references to the top left and bottom right coordinates of the image in a Texture Atlas. A Texture Atlas is one giant image composed of other smaller images. Instead of creating a separate GPU buffer for just the image, WebRender uses one giant texture with parts of the Texture Atlas used for individual images. The vertex shader will reference these coordinates later as sampling coordinates to render the actual image from the Texture Atlas. The list of Packed Primitives that are of the same type to render are called batches.

Next, WebRender does some notetaking on what needs to be rendered and notifies the Compositor thread, which is really the main thread of the process, that we need to update the screen. ProTip: Doing GPU operations not on the main thread means you're going to have a bad time. Once the compositor is notified, WebRender finally starts doing actual graphics things with OpenGL! The biggest brain wreck for me was realizing that with GPUs, you issue commands to the GPUs but you don't know when the GPU will actually execute the commands. In reality, it means you're queuing up work to do, and it will execute sometime in the future. Second, with shaders, you map vertices to different points on a [0..1] axis and transform them onto the screen. This makes GPUs very hard to debug as you can't actually single step through your shader to see what is going on.

If you already know OpenGL, you can skim some of this section. Otherwise, WebRender first compiles all the shaders, which are unique to the PackedPrimitiveType. In our example, we have a dedicated ps_image vertex shader that gets compiled and run whenever we need to draw an image. We also set up our vertices to be two triangles that split a quad from [0,0] to [1,1]. This gets bound to the 'aPosition' Vertex Attribute, which we'll use in the vertex shader. Then we go through every batch and start issuing GL calls. For images, we'll start here. The chunk data here is the PackedPrimitiveData for each tile. Thus for each tile we'll issue a draw call, but the draw call is an instanced draw call, which is a key rendering concept for WebRender. An instanced draw call increments gl_InstanceID that is accessible in the vertex shader. This let's us draw multiple times of the same indices without issuing new draw commands to the GPU, speeding up performance. Finally, we can take a look at the image vertex shader:

void main(void) {
    // gl_InstanceID is auto incremented due to instanced call.
    Image image = images[gl_InstanceID];
    Layer layer = layers[image.info.layer_tile_part.x];
    Tile tile = tiles[image.info.layer_tile_part.y];

    // This vertex's position within the CSS box. 
    // aPosition iterates through the 4 corners from [0,0] to [1,1]
    vec2 local_pos = mix(image.local_rect.xy,
                         image.local_rect.xy + image.local_rect.zw,

    vec4 world_pos = layer.transform * vec4(local_pos, 0, 1);

    vec2 device_pos = world_pos.xy * uDevicePixelRatio;

    // Clamp the position to the tile position
    vec2 clamped_pos = clamp(device_pos,
                             tile.actual_rect.xy + tile.actual_rect.zw);

    // Get the tile's position in relation to the CSS box.
    vec4 local_clamped_pos = layer.inv_transform * vec4(clamped_pos / uDevicePixelRatio, 0, 1);

    // vUv will contain how many times this image has wrapped around the image size.
    // These 3 variables are passed onto the fragment shader.
    vUv = (local_clamped_pos.xy - image.local_rect.xy) / image.stretch_size.xy;
    vTextureSize = image.st_rect.zw - image.st_rect.xy;
    vTextureOffset = image.st_rect.xy;

    // Transform the position onto the screen.
    vec2 final_pos = clamped_pos + tile.target_rect.xy - tile.actual_rect.xy;

    gl_Position = uTransform * vec4(final_pos, 0, 1);

The vertex shader reads an Image structure, which then lets the shader reads the tile it's currently rendering. The positions on the 4 corners of the CSS image here (the corners of the 100x50 test case), get clamped to the tile boundaries, for every tile. Finally, it maps the tile boundaries to a position within the Texture Atlas and passes three variables (vUv, vTextureSize, and vTextureOffset to the fragment shader. Note that vTextureSize here are sizes as reported in the Texture Atlas, not the actual size of the image, which is bound between [0..1]. With OpenGL, you only map points to the texture. The vertex shader is mapping the 4 corners of each tile to a specific location of the CSS box. These locations of the CSS boxes map to a specific point within the background 50x33 pixel image. We're just telling the GPU where each of these points are respectively. Consider the illustration:

Consider the tile boundary on the top right of this tile. This tile corner is some position, aPoint, within the CSS image we wanted to render, the original 100 px x 50 px image. Based on the location of aPoint, we map it to the location of the actual image in the Texture Atlas.

void main(void) {
    vec2 st = vTextureOffset + vTextureSize * fract(vUv);
    oFragColor = texture(sDiffuse, st);

Lastly, we're at the fragment shader! If we look back at how we calculated the three variables vUv, vOffset, and vTextureSize. vTextureSize is the size of the texture within the Texture Atlas. vOffset is where the start of the image is within the Texture Atlas. vUv is the only interesting part, which calculates how many times an image should be rendered at the current tile corner. In the sample case, at tile 1 with a X offset of 64, we've rendered the image (64px - tile offset / 50 - image size) or 1.28 times. This means the image needs to wrap around and repeat again. Thus in the fragment shader, we take the fractional part of this calculation (0.28) and say this point maps to 0.28% from the left of the source image. We just mapped the tile corners to the appropriate points within the source image. The GPU does the rest and interpolates everything else in between. On a side note, the fragment shader gets called for every pixel whereas the vertex shader gets called for every vertex. Since we only supplied 6 vertices for two triangles to draw one rectangle, the vertex shader would be called 6 times. We have to do the fract() calculation in the fragment shader since the actual value vUv will be interpolated between the values we supplied in the vertex shader for vUv at the appropriate 6 indices. (e.g. if at vertex [0,0] we supplied a vUv value of 0.3 and at vertex [1, 0], we supplied a vUv value of 0.6, every call to the fragment shader for pixels between [0,0] and [1,0] will be an interpolated value between 0.3 and 0.6.).

Whew that was a lot and a crazy brain dump because GPUs are confusing. There's still a bunch more to WebRender such as compositing the Stacking Contexts together when they need blending, but I haven't gotten there yet. You can also learn a bit more on Air Mozilla.

Thanks very much to Glenn Watson for proofreading and explaining so many things to me.

Project Silk

For the past few months, I've been working on Project Silk which improves smoothness across the browser. Very much like Project Butter for Android, part of it is finally live on Firefox OS. Silk does three things:

  1. Align Painting with hardware vsync
  2. Resample touch input events based on hardware vsync
  3. Align composites with hardware vsync

What is vsync, why vsync and why does it matter at all?

Vertical synchronization occurs when the hardware display shows a new frame onto the screen. This rate is set by specific hardware, but major displays in the US occur at 60 times a second, or every 16.6 ms. This is where you hear about 60 frames per second, one frame every time the hardware display refreshes. What this means in reality is that no matter how many frames are produced in software, the hardware display will still only show at most 60 unique frames per second.

Currently in Firefox, we mimic 60 frames per second and therefore vsync with a software timer that schedules rendering every 16.6 ms. However, the software scheduler has two problems: (a) it's noisy and (b) can be scheduled at bad times relative to vsync.

In regards to noise, software timers are much noiser than hardware timers. This creates micro-jank for a number of reasons. First, many animations are keyed off timestamps that occur from the software scheduler to update the position of the animation. If you've ever used requestAnimationFrame, you get a timestamp from a software timer. If we want smooth animations, the timestamp provided to requestAnimationFrame should be uniform. Non-uniform timestamps will create non-uniform and janky animations. Here is a graph showing software versus hardware vsync timer uniformity:

Wow! That's a lot better with a hardware timer. With hardware timers, we get a much more uniform, and therefore smoother timestamp to key animations off of. So that's problem (a), noisy timers in software versus hardware.

With part (b), software timers can be scheduled at bad times relative to vsync. Regardless of what software does, the hardware display will refresh on it's own clock. If our rendering pipeline finishes producing a frame before the next vsync, the display is updated with new content. If we fail to finish producing a frame before the next vsync, the previous frame will be displayed, causing jankiness. Some rendering functions can occur close to vsync and overflow until the next interval, so we actually introduce potentially more latency since the frame won't be displayed on the screen anyway until the next vsync. Let's look at this in graphic form:

At time 0, we start producing frames. Let's say all frames for the sake of example take a constant time of 10 ms. Our frame budget is 16.6 ms because we only have to finish producing a frame before the hardware vsync occurs. Since frame 1 is finished 6 ms before the next vsync (time t=16 ms), everything is successful and life is good. The frame is produced in time and the hardware display will be refreshed with the updated content.

Now let's look at Frame 2. Since software timers are noisy, we start producing a frame 9 ms from the next vsync (time t=32). Since our frame takes 10 ms to produce, we'll actually finish producing this frame at 1 ms AFTER the next vsync. That means at vsync number 2 (t=32), there was no new frame to display, so the display still shows the previous frame. In addition, the frame just produced won't be shown until vsync 3 (t=48), because that's when hardware updates itself. This creates jank since now the display will have skipped 1 frame and try to catch up in the upcoming frames. This also produces one extra frame of latency, which is terrible for games.

Vsync improves both of these problems since we get both a much more uniform timer and the maximum amount of frame budget time to produce a new frame. Now that we know what vsync is, we can finally go onto what Project Silk is and why it helps create smooth experiences in Firefox.

The Rendering Pipeline

Gecko's rendering pipeline in super simplified terms does three things:

  1. Paint / draw the new frame on the main thread.
  2. Send the updated content to the Compositor via a LayerTransaction.
  3. Composite the new content.

In the ideal world, we'd be able to do all three steps within 16.6 ms, but that's not the case most of the time. Both steps (1) and (3) are on independent software timers. Thus there was no real synchronizing clock between the three steps, they are all ad-hoc. They also had no relation to vsync, so the timing of the pipeline wasn't related to when the display would actually update the screen with the content. With Silk, we replace both independent software timers with the hardware vsync timer. For our purposes, (2) isn't really affected useful but only here for completeness.

Align Painting with Hardware Vsync

Aligning the timer used to tick the refresh driver with vsync creates smoothness in a couple of ways. First, many animations are still done on the main thread, which means any animation using timestamps to set the position of an animation should be smoother. This includes requestAnimationFrame animations! The other nice thing is that we now have a very strict ordering of when rendering is kicked off. Instead of (1) and (3), which are related occurring at a synced offset, we start rendering at a specific time.

Resample Touch Input Events Based on Vsync

With Silk, we can enable touch resampling which improves smoothness while tracking your finger. Since I've gone over what touch resampling is quite a bit, I'll leave this short. With Silk, we can finally enable it!

Align Composites with Hardware Vsync

Finally, the last part of Silk is aligning composites with hardware vsync. Compositing takes all the painted content and merges it together to create a single image you see on the display. With Silk, all composites start right after a hardware vsync occurs. This has actually produced a rather nice side benefit: reduced composite times. See:

Within the device driver on a Flame device, there's a global lock that's grabbed when close to vsync intervals. This lock can take 5-6 ms to get, greatly increasing the composite times. However, when we start a composite right after a vsync, there is little contention to grab the lock. Thus we can shave off the wait, therefore reducing composite times quite a bit. Not only do we get smoother animations, but also reduced composite times and therefore better battery life. What a nice win!

With all three pieces, we now have a nice strict ordering of the rendering pipeline. We paint and send the updated content to the Compositor within 16.6 ms. At the next vsync, we composite the updated content. At the vsync after that, the frame should have gone through the rendering pipeline and will be displayed on the screen. Keeping this order reduces jank because we reduce the chances that the timers schedule each step at a bad time. For example, in the current implementation without silk, the best case would be that a frame could be painted and composited within a single 16.6 ms frame, which is great. However, if the next frame takes 2 frames instead, we just created extra jank even though no stage in the pipeline was really slow. Aligning the whole pipeline to create a strict ordering reduces the chances that we mis-schedule a frame.

Here's a picture of the rendering pipeline without Silk. We have Composites (3) on the bottom of this profile. We have painting (1) in the middle, where you see Styles, Reflow, Displaylist and Rasterize. We have Vsync, which are those small little orange boxes at the top. Finally we have Layer Transactions (2) at the bottom. First, when we start Compositing and painting are not aligned, so animations will be at different positions depending if they are on the main thread or compositor thread. Second, we see long composites because the compositor is waiting on a global lock in the device driver. Lastly, it's rather difficult to read any ordering or see if there is a problem without deep knowledge of why / when things should be happening.

Here is a picture of the same pipeline with Silk. Composites are a little shorter, and the whole pipeline only starts at vsync intervals. Composite times are reduced because we start composites exactly at vsync intervals. There is a clear ordering of when things should happen and both composites and painting are keyed off the same timestamp, ensuring smoother animations. Finally, there is a clear indicator that as long as everything finishes before the next Vsync, things will be smooth.

Overall, Silk hopes to create a smoother experience across Firefox and the web. Numerous people contributed to the project. Thanks to Jerry Shih, Boris Chou, Jeff Hwang, Mike Lee, Kartikaya Gupta, Benoit Girard, Michael Wu, Ben Turner, Milan Sreckovic for their help in making Silk happen.

Android's Touch Resampling Algorithm

After digging through the internet and some advice from Jerry Shih, I looked into Google Android's touch resampling algorithm. For those interested, it is located here. Android's touch resampling algorithm is amazing and hats off to the Googler who figured it out. Android uses a combination of touch extrapolation and touch interpolation. Touch interpolation means we take two touch events and create a single touch event somewhere in the middle of the two touch events. Touch extrapolation means we take two touch events and create a single touch event somewhere ahead of the last touch event or predict where a touch event will be. Let's take a look at our standard input from both the LCD refresh display rate of 60 hz and the touchscreen refresh scan rate of 100 hz. (For those who need catching up, please read the previous post).

We have touch input events coming in at every 10 milliseconds that moves by 10 pixels and a refresh display vsync event every 16.6 milliseconds. Android's algorithm, creates a new timing event called Sample Time, which is always 5 milliseconds behind the vsync event. Thus, the first sample time is at time st=11 ms (16 vsync time - 5 sample time), then the next sample time event at time st=27 ms (32 ms vsync - 5), and the next at 43 ms (48 - 5). Let's take a look at the same graph with the sample time:

The sample time is used to balance and determine whether touch events are extrapolated or interpolated. We only ever look at the last two real touch events, never at a sampled event. If the last touch event occurs AFTER the sample time, but BEFORE the vsync time, we interpolate. This occurs for example at time vsync t=32 ms. The sample time is time st=27 ms, but we have two touch events. One touch event at time t1=20 ms and time t2=30 ms. Since the last touch event, t2, is at 30 ms and after 27ms, we interpolate these two touch events. If the last touch event occurs BEFORE the sample time, we extrapolate touch events. For example, at vsync event t=48 ms, we have a sample time of st=43 ms. The last touch event is at time t=40 ms, which is before the sample time. We will use the last two touch events at time t1=30ms and t2=40ms to extrapolate the touch event for vsync event t=48. If there exists a touch event AFTER the sample time, not the vsync time, we interpolate the last two touches. If both touch events occur BEFORE the sample time, we extrapolate the last two touch events.


Let's take a look at the interpolation case. It is not a simple midpoint interpolation as it takes into account when the touch event occurs relative to the sample time. First, we calculate the amount of time that has passed between the two last touch events, let's call this the touch time difference. This should relatively stable per device. In our case this should always be 10 milliseconds. Next, we calculate the time difference between the sample time and the touch event that is BEFORE the sample time (touch sample diff). For our example at vsync t=32, the sample time is 27ms. The first touch event that is before the sample time is at t=20ms. Thus, 27 ms sample time - 20 ms touch time = 7ms. Next, we create a variable called alpha that is the touch sample difference / touch time difference. In this case, 7 ms / 10 ms = 0.7. Finally, we do linear interpolation with this alpha value as the midpoint modifier between the touch event after the sample time and the touch event prior to the sample time. Our two touch events were at time t=20, displacement d=20 and t=30 with a displacement of d=30. We start by using the first touch event before the sample and add an interpolated displacement to it. Thus, we have an interpolated displacement d=20 + (30 - 20) * alpha, which is 20 + (10 * 0.7) = 27. Thus at vsync time t=32, we send a touch event with a displacement of d=27. A larger alpha means we lean more towards the last touch event and a lower alpha means we lean more towards the first touch event. Here it is in equation form:

Here is the general equation. The LastTouch refers to the touch ahead of the SampleTime. The FirstTouch refers to the touch before the SampleTime.

Let's take a look at another example, at vsync time t=80. We have two touch events, one at time t=70 and another at time t=80. Our sample time here is 80 ms - 5 = 75 ms. Since we have one touch event t=80 that occurs after the sample time we interpolate. Here is the work:

We see that we have a final result of displacement d=75. That's it for interpolation.


Touch extrapolation occurs when the last touch event occurs BEFORE the sample time. This occurs at vsync time t=48. Our sample time here is 48 - 5 = 43 ms. We have two touch events one at time t=30 and another at time t=40 ms. Since both are before 43 ms, we extrapolate the touch events. The logic works similarly to the touch interpolation with a few differences. We still have the same touch time difference between the two touch events which is always 10 ms. Next, we calculate the touch sample difference, which is now the last touch event minus the sample time, thus we expect a negative number. Thus we take the last touch event t=40 - the sample time st=43 = -3. Next we calculate alpha the same way which is touch sample difference / touch time difference. This case it is (-3 / 10) = -0.3. Finally we again use the same linear interpolation equation, but because alpha is negative we will extrapolate. In addition, we swap the operands and subtract the first touch from the last touch and set the starting displacement to be the last touch. Thus unlike the interpolation algorithm, we start with the last touch event and add a displacement to that. Our final result is displacement d=40 + (30 - 40) * -0.3 = 43. Thus we extrapolated 3 pixels in this case. Here's all the math:

Here is the general extrapolation equation. The Last Touch refers to the most recent touch. First touch refers to the earlier touch event. We still use the last two touch events to extrapolate.


Let's do one more extrapolation case at vsync time t=128 ms. Here is the work for a final displacement of d=123ms.

Here is the final displacement figures if we do interpolation and extrapolation across our original input:

Wow. Just wow. Outside of the first frame because we have no extra touch events, it hits the ideal case of moving a displacement of 16 pixels every 16 milliseconds for a perfect smooth scroll. It has a Frame Uniformity standard deviation of 0. What's even more amazing is that these formulas to calculate touch interpolation and extrapolation works across different touch screen refresh rates. For example, Mozilla's Flame reference device has a touch refresh rate of every 13 ms instead of every 10 ms like the Nexus 4 and it still works to create a perfect smooth scroll. Even more amazing, this algorithm accounts for touch driver noise because real devices don't send a touch event at exactly perfect 10 ms intervals, they jitter so you get touch events at 9.9ms or 10.5ms.

How does it feel in real life? Since we actually extrapolate touch events, if we're scrolling, it feels more responsive since it seems to keep up with your finger better. In addition, flings feel faster making the whole system feel not necessarily smoother but more responsive. While tracking your finger, or doing a slow scroll, we get the additional benefit that scrolling feels smoother as well. In all aspects, compared to the other touch algorithms, this is superior. On fast scrolls it keeps up with your finger better than the previous sampling algorithms. On slow scrolls, it smooths out the touch events creating a smooth experience. The only downside is that if you change your fingers direction, it will be a bit more jarring if we extrapolate then change directions. However, in real life I only barelynoticed it rarely and only because I was looking for it. Overall, a very nice win.


Smooth Scrolling, Frame Uniformity, Touch Interpolation and Touch Responsiveness

I hope you have coffee, because this one is going to be a long one. When scrolling a webpage on a device, how smooth is it? Smoothness is how little jerk do you see while scrolling the webpage. The big problem with this measurement is that it is somewhat subjective, it just "feels" smoother. However, we can actually measure it with a metric called Frame Uniformity. Frame Uniformity is a measure of how smooth a page is scrolling and to have a high Frame Uniformity, lots of things have to be right. The essential measure is:

If we have a constant drag of [A] px per unit of time, the screen should also have a constant drag of [A] px per unit of time.

For example, if my magic hand could scroll a web page at a constant and consistent rate of 1000 pixels per second (I'm human and this doesn't work, but we have tools), the webpage on the screen should scroll at a constant rate of 1000 pixels per screen. Since a screen refreshes at 60 hz for 60 frames per second, or every 16.6 ms, that means at every screen refresh, or every frame, the screen should scroll by 16.6 pixels every frame. This would be a perfect smooth scroll and have the highest mark on Frame Uniformity.

The vertical bars are vsync events, which is when the display refreshes and shows a new frame. The number above the vertical bar represents the absolute position shown to the user at that vsync event. The horizon bar represents the displacement, or the amount a user has perceived to scroll, between the two frames. Thus, ideally we'd have a displacement in increments of 16 pixels every 16 milliseconds, with a difference of 16 pixels per frame. (Rounded to ints for simplicity). Visually, imagine that if each Firefox logo represents one frame, it would be a perfectly smooth scroll:

Unfortunately, due to the real world, it is very difficult to achieve a perfect 16.6 pixel scroll with a 16.6 constant drag rate. There is skew all along the system in the hardware, touch driver, the system processing, etc. My understanding is that no device on the market achieves perfect Frame Uniformity. However, low Frame Uniformity equals janky or jerky scrolling, which is not something we want when trying to deliver a high quality product. So what are the problems that create bad Frame Uniformity and how do we solve them?

The biggest problem with Frame Uniformity and smooth scrolling is that the display screen and touch screen refresh at different rates. The display refreshes at 60hz, or every 16.6 ms whereas most commodity touch screens refresh at 100hz, or every 10ms. The touch screen refreshing at 100hz means that it scans the touch screen for touch input every 10ms. Note, these are "estimates" and have skew. For example, the touch screen could scan at 9.9 ms, or 10.1ms, or 9.5ms depending on the quality of the hardware. The better hardware, the lower skew. Thus, we have an uneven distribution of touch events versus display refreshes, which is a big source of jank.

Consider the average case. The display refreshes every 16.6ms and we perfectly dispatch to the touchscreen a touch move scroll of 10 pixels every 10 ms. I will round the 16.6 ms to 16ms just to make things easier. Our input from the hardware means we will have a graph like this:

We get a new touch event at a displacement of d=10 pixel increments every 10 ms. The vysnc event is when the hardware display refreshes the screen. What we see is that in some cases, we only have to process 1 touch event, e.g. at vsync event time t=16ms, and a displacement of 10 pixels. At the next vsync event at time t=32, we see we have two touch events. One with a displacement of d=20px and another with a displacement of d=30 pixels. At this one vsync event at t=32, we have to process two touch events but can only display one! At the first vsync event of t=16, we only have to process one touch event with a displacement of d=10.

Since we can actually only process one touch event per frame, we can just take the last touch event of a displacement=30 at time t=32. But what does this mean? It means at the first vsync event of t=16, we scrolled by 10 pixels. At the next vsync event of t=32, we scrolled by 20 pixels. (First frame of t=10, we were at 10px, then at 30px at t=32. 30 - 10 = 20 pixel scroll difference). If we extrapolate this a couple more times across the next few scrolls, we start to see a pattern. At the vsync event of t=48, we have one touch event of displacement 40, so we move to pixel 40. This means a difference of (40 - 30 = 10), or 10 pixels difference in one frame. So at the first frame, we moved 10 pixels. The second frame, 20 pixels, and the third frame 10 pixels. Remember that the ideal was 16.6 constant drag per frame. What we have now is an alternating pattern of 10 pixels, to 20 pixels, to 10 pixels. Here is the whole extrapolation:

This touch sequence visually looks like this to the user:

This isn't smooth at all! It's pretty jerky! Actually it's measurable to have a Frame Uniformity standard deviation of 5. What this means it the standard deviation between frame displacements across this interval is 5. The ideal case would be 0, meaning no differentiation and perfectly smooth. Not too bad but not great, What do we do!?

Touch Interpolation

This is where touch interpolation comes into play. The idea of touch interpolation is to smooth out the input's refresh rate and match it to the display driver's refresh rate. It is averaging out the touch events, coalescing them into one touch sample. This has two benefits. First, the system won't have to respond to touch events that won't be shown on the screen, which reduces system load. Second, the touch data can be manipulated to smooth out the choppiness on the screen, increasing Frame Uniformity. The first question though is why touch interpolation and not touch extrapolation? Touch extrapolation tries to predict where the touch event will be, and create one touch event with predictive touch behavior. The problem with extrapolation is that when a user swipes back and forth quickly, we can overscroll past what the user actually touched. The good thing about touch interpolation is that we never create a touch event that the user's actual finger wasn't on. Every interpolated touch will be a touch event on a path that the user actually had their finger on.

Great, so we want touch interpolation, what do we do? Do we just take two touch inputs and make them into one? Seems simple enough right? Let's see how that works out.

Midpoint Touch Interpolation

The first algorithm we introduce is a basic touch interpolation algorithm. If we have two touch events in one vsync event, we take the midpoint and coalesce the two touches into a touch. If we have one touch event in a vsync event, we just dispatch the one touch event. So for example, at vysnc time t=16, we dispatch a touch event with displacement=10. At vsync t=32, we take the midpoint between touch events [d=20, d=30] to create one touch event with displacement d=25. What we see is actually quite an improvement! We have a series of dispatching touch events with a difference of 15 pixels, with sometimes a jump to moving by 20 pixels. This has a Frame Uniformity of 2.2, which is much better than the standard deviation of 5! So overall, this should translate into a smoother scroll. Visually, it looks something like this:

Overall, a nice improvement in smoothness relative to the current original problem. However, can we do better?

Midpoint of Last 2 Touches

Can we use previous touch events to smooth out the difference in frame displacement? Can we use the past to help us tweak the future to stabilize big changes? One example algorithm would be that if we do not have two touch events, we can use the previous frame's last touch event to create a sample for the current frame. Thus, we always use the last two touch events and use the midpoint to create an interpolated touch event for the current frame. We never interpolate with an interpolated touch, we just use the previous two real touches. If a vsync event only has one touch event, we use the current touch event and the last touch event from the previous frame. If the current vsync event has two touch events, we create a sample from the two current touch events. The intuition behind this is that if we have a big change, we can use the previous touch event to smooth out the big change and create a less noticeable jank. How does this look:

Visually, this looks like:

Interestingly, this has a standard deviation of 4.86, which is almost as bad as the current situation! It actually doesn't smooth anything out in this case and instead continues the alternating pattern of 10 pixels and 20 pixel displacements we had before. Bummer.

Midpoint of Last Two Touches With Sampling

What about trying to use a previous interpolated touch to smooth out the touch events? This version always uses the last two touch events and creates a single touch using the midpoint if they exist in the current vsync. If the current vsync only has one touch event, we interpolate the current touch event plus the previous frame's sampled touch event. This is a slight variation of the Midpoint of the Last Two Touches algorithm in that we now incorporate previous samples, not just previous touch events. How does this look:

Wow we made it worse! We see large jumps of up to 23 pixels and really small jumps of only 7 pixels. The standard deviation between frames jumped to 7.27. We made it worse! Ok, we probably shouldn't do extra damage here.

Last Sample and Last Touch

In this last algorithm, we touch interpolate the previous frame's sample and the current frame's latest touch event. If we have two touch events in a single vsync, we use the latest touch event plus the previous frame's sample, ignoring the middle touch event. If the vsync has one touch event, we touch interpolate the current frame's touch event with the previous frame's resampled touch. How does this look:

Wow, that looks a bit smoother huh. Interestingly, it lags behind a frame in the beginning since we resample the previous frame, and we see a nice oscillating frame displacement of ~13-17. We see that it improves the Frame Uniformity standard deviation to 3.05.

If we evaluate all the algorithms, using only the Frame Uniformity standard deviation metric, it looks like just using the midpoint of the last two touches wins right? It has the lowest standard deviation at 2.2. The next best one, Last Sample and Last Touch(LSLT) has a standard deviation of 3.05, so should be an easy win? Almost, if only it was that easy.

Evaluating Midpoint versus Last Sample and Last Touch and Touch Responsiveness

If we look closely at using the Midpoint algorithm, what we see is a consistent drag of 15 pixels, one pixel behind the ideal 16 pixels per frame, with a jump up to 20 pixels every 3-4 frames to catch up. When we look at the Last Touch and Last Sample algorithm, we see a huddle of 15, to 17, to 14, to 17 pixels. The LSLT algorithm is better at keeping up with the ideal 16 pixels and doesn't fall behind as quickly.

Visually, with the Midpoint algorithm, we'll see perfect smoothness then a single jank, then back to smoothness. However, since this occurs every 3-4 frames, or every 48 - 60 ms, this visually still looks like jank quite often. With the Last Touch and Last Sample algorithm, we have a constant jank of 1-3 pixels per frame offset from the ideal. However, visually, LSLT is smoother because the difference between frames is less. For example, with the Midpoint algorithm, we have one large jank of 5 pixels every 3-4 frames. With the Last Touch and Last Sample algorithm, we're off of the ideal by 1 pixel one frame, then 1 pixel again, then 2, then 1. The jank is amortized over a couple of frames.

Imagine watching a car driving in a perfectly straight line. If the car swerved just a little bit for one second then went back to a perfect straight line, you can easily notice the car driving out of the lane. However, if the car was always alternating just a little but, but was mostly straight, each individual change mostly indiscernible, it would seem smoother. This is kind of the difference between the two algorithms.

Numerically, this is an interesting difference between the Last Sample and Last Touch (LSLT) algorithm and the Midpoint algorithm. Remember that the ideal displacement is in increments of 16 pixels per frame. Let's take a look at the absolute displacement of the ideal case, the Midpoint algorithm, and the Last Sample and Last Touch algorithm:

Ideal:         [16, 32, 48, 64, 80, 96, 112, 128, 144, 160]
Midpoint:  [10, 25, 40, 55, 75, 90, 105, 120, 135, 155]
LSLT:         [10, 20, 30, 45, 62, 76,  93,  106, 123, 141]

Hmm, it looks like the Midpoint algorithm is much better at tracking the ideal case. However, the numbers from Last Sample Last Touch look pretty interesting. They look close to increments of 16, just one frame behind. Let's take a look again by trailing one frame behind:

Ideal:     [16, 32, 48, 64, 80, 96, 112, 128, 144, 160]
LSLT:     [20, 30, 45, 62, 76, 93, 106, 123, 141]

Wow that is really close to ideal isn't it! Let's see how much closer. Let's take the difference from the ideal for each frame. For the Midpoint algorithm, we'll match it with the current frame. For the LSLT algorithm, we'll match it with one frame behind. For example, for the second frame, we'll take 32 - 25 = 7 frame difference for the Midpoint algorithm. For LSLT, we'll take (32 - 30 = 2) frame difference.

Midpoint: [6, 7, 8, 9, 5, 6, 7, 8, 9, 5]. Average = 7 px.
LSLT:        [4, 2, 3, 2, 4, 4, 6, 5, 3]. Average = 3.8 px.

With the LSLT algorithm, we're very close to an ideal scroll, just one frame behind, with an average of 3.8 pixels away from the ideal. In addition, if we measure Frame Uniformity by disregarding the first two frames between the Midpoint algorithm and the Last Sample Last Touch algorithm, we see an improvement in Frame Uniformity as well. Frame Uniformity for the Midpoint algorithm increases from 2.2 to 2.44 whereas LSLT decreases from 3.05 to 1.86, outperforming the Midpoint algorithm. Thus the LSLT algorithm has a few interesting characteristics. The first couple of frames will be worse but the middle and end will be much better. Since an average scroll takes a couple of hundreds of milliseconds, a user would only see a couple of early frames as slow, then overall much better. While in the middle of a scroll, we will track your finger pretty well, and pretty smoothly. The trade off is that we'll add 16.6 ms of latency and be somewhat behind in terms of displacement. Overall, scrolling will be smooth, but it will feel a little less responsive when tracking your finger.

Here are is a visual comparison of the last few frames comparing the ideal, the Midpoint algorithm, and the Last Sample Last Touch algorithm in terms of smoothness.


Does any of this matter? Does a Frame Uniformity decrease of 5 to 2 actually mean anything? What about adding one frame of latency? Below is a video of the Last Sample Last Touch interpolation algorithm on a Flame device with and without touch interpolation. Make sure you push the HD button. *Hint: Try to see which one you think is smoother and the answer for which device has touch interpolation is at the bottom. It's difficult to see the "smoothness" on a video, but in person it is very noticeable. Since I'm swiping my finger back and forth, you can see the touch latency as well. Is it worth it to trade smoothness for latency? I'm not sure, but at least it's smooth.

The last question is, is it noticeable in day to day use? Smoothness while tracking your finger is dramatically improved. Smoothness while flinging or already in a scroll is an animation and is somewhat affected by this. Scrolls will be appear slower because we think we're scrolling slower and thus calculate a slower velocity. Smoothness while slowing down a scroll is unaffected as that is also an animation. However, in terms of smoothness while tracking your finger, it's pretty awesome.

Make sure to push the HD button for moar HDs!


All the algorithms on one giant graph:

* The device on the left has touch interpolation.