Smooth Scrolling, Frame Uniformity, Touch Interpolation and Touch Responsiveness

I hope you have coffee, because this one is going to be a long one. When scrolling a webpage on a device, how smooth is it? Smoothness is how little jerk do you see while scrolling the webpage. The big problem with this measurement is that it is somewhat subjective, it just "feels" smoother. However, we can actually measure it with a metric called Frame Uniformity. Frame Uniformity is a measure of how smooth a page is scrolling and to have a high Frame Uniformity, lots of things have to be right. The essential measure is:

If we have a constant drag of [A] px per unit of time, the screen should also have a constant drag of [A] px per unit of time.

For example, if my magic hand could scroll a web page at a constant and consistent rate of 1000 pixels per second (I'm human and this doesn't work, but we have tools), the webpage on the screen should scroll at a constant rate of 1000 pixels per screen. Since a screen refreshes at 60 hz for 60 frames per second, or every 16.6 ms, that means at every screen refresh, or every frame, the screen should scroll by 16.6 pixels every frame. This would be a perfect smooth scroll and have the highest mark on Frame Uniformity.

The vertical bars are vsync events, which is when the display refreshes and shows a new frame. The number above the vertical bar represents the absolute position shown to the user at that vsync event. The horizon bar represents the displacement, or the amount a user has perceived to scroll, between the two frames. Thus, ideally we'd have a displacement in increments of 16 pixels every 16 milliseconds, with a difference of 16 pixels per frame. (Rounded to ints for simplicity). Visually, imagine that if each Firefox logo represents one frame, it would be a perfectly smooth scroll:

Unfortunately, due to the real world, it is very difficult to achieve a perfect 16.6 pixel scroll with a 16.6 constant drag rate. There is skew all along the system in the hardware, touch driver, the system processing, etc. My understanding is that no device on the market achieves perfect Frame Uniformity. However, low Frame Uniformity equals janky or jerky scrolling, which is not something we want when trying to deliver a high quality product. So what are the problems that create bad Frame Uniformity and how do we solve them?

The biggest problem with Frame Uniformity and smooth scrolling is that the display screen and touch screen refresh at different rates. The display refreshes at 60hz, or every 16.6 ms whereas most commodity touch screens refresh at 100hz, or every 10ms. The touch screen refreshing at 100hz means that it scans the touch screen for touch input every 10ms. Note, these are "estimates" and have skew. For example, the touch screen could scan at 9.9 ms, or 10.1ms, or 9.5ms depending on the quality of the hardware. The better hardware, the lower skew. Thus, we have an uneven distribution of touch events versus display refreshes, which is a big source of jank.

Consider the average case. The display refreshes every 16.6ms and we perfectly dispatch to the touchscreen a touch move scroll of 10 pixels every 10 ms. I will round the 16.6 ms to 16ms just to make things easier. Our input from the hardware means we will have a graph like this:

We get a new touch event at a displacement of d=10 pixel increments every 10 ms. The vysnc event is when the hardware display refreshes the screen. What we see is that in some cases, we only have to process 1 touch event, e.g. at vsync event time t=16ms, and a displacement of 10 pixels. At the next vsync event at time t=32, we see we have two touch events. One with a displacement of d=20px and another with a displacement of d=30 pixels. At this one vsync event at t=32, we have to process two touch events but can only display one! At the first vsync event of t=16, we only have to process one touch event with a displacement of d=10.

Since we can actually only process one touch event per frame, we can just take the last touch event of a displacement=30 at time t=32. But what does this mean? It means at the first vsync event of t=16, we scrolled by 10 pixels. At the next vsync event of t=32, we scrolled by 20 pixels. (First frame of t=10, we were at 10px, then at 30px at t=32. 30 - 10 = 20 pixel scroll difference). If we extrapolate this a couple more times across the next few scrolls, we start to see a pattern. At the vsync event of t=48, we have one touch event of displacement 40, so we move to pixel 40. This means a difference of (40 - 30 = 10), or 10 pixels difference in one frame. So at the first frame, we moved 10 pixels. The second frame, 20 pixels, and the third frame 10 pixels. Remember that the ideal was 16.6 constant drag per frame. What we have now is an alternating pattern of 10 pixels, to 20 pixels, to 10 pixels. Here is the whole extrapolation:

This touch sequence visually looks like this to the user:

This isn't smooth at all! It's pretty jerky! Actually it's measurable to have a Frame Uniformity standard deviation of 5. What this means it the standard deviation between frame displacements across this interval is 5. The ideal case would be 0, meaning no differentiation and perfectly smooth. Not too bad but not great, What do we do!?

Touch Interpolation

This is where touch interpolation comes into play. The idea of touch interpolation is to smooth out the input's refresh rate and match it to the display driver's refresh rate. It is averaging out the touch events, coalescing them into one touch sample. This has two benefits. First, the system won't have to respond to touch events that won't be shown on the screen, which reduces system load. Second, the touch data can be manipulated to smooth out the choppiness on the screen, increasing Frame Uniformity. The first question though is why touch interpolation and not touch extrapolation? Touch extrapolation tries to predict where the touch event will be, and create one touch event with predictive touch behavior. The problem with extrapolation is that when a user swipes back and forth quickly, we can overscroll past what the user actually touched. The good thing about touch interpolation is that we never create a touch event that the user's actual finger wasn't on. Every interpolated touch will be a touch event on a path that the user actually had their finger on.

Great, so we want touch interpolation, what do we do? Do we just take two touch inputs and make them into one? Seems simple enough right? Let's see how that works out.

Midpoint Touch Interpolation

The first algorithm we introduce is a basic touch interpolation algorithm. If we have two touch events in one vsync event, we take the midpoint and coalesce the two touches into a touch. If we have one touch event in a vsync event, we just dispatch the one touch event. So for example, at vysnc time t=16, we dispatch a touch event with displacement=10. At vsync t=32, we take the midpoint between touch events [d=20, d=30] to create one touch event with displacement d=25. What we see is actually quite an improvement! We have a series of dispatching touch events with a difference of 15 pixels, with sometimes a jump to moving by 20 pixels. This has a Frame Uniformity of 2.2, which is much better than the standard deviation of 5! So overall, this should translate into a smoother scroll. Visually, it looks something like this:

Overall, a nice improvement in smoothness relative to the current original problem. However, can we do better?

Midpoint of Last 2 Touches

Can we use previous touch events to smooth out the difference in frame displacement? Can we use the past to help us tweak the future to stabilize big changes? One example algorithm would be that if we do not have two touch events, we can use the previous frame's last touch event to create a sample for the current frame. Thus, we always use the last two touch events and use the midpoint to create an interpolated touch event for the current frame. We never interpolate with an interpolated touch, we just use the previous two real touches. If a vsync event only has one touch event, we use the current touch event and the last touch event from the previous frame. If the current vsync event has two touch events, we create a sample from the two current touch events. The intuition behind this is that if we have a big change, we can use the previous touch event to smooth out the big change and create a less noticeable jank. How does this look:

Visually, this looks like:

Interestingly, this has a standard deviation of 4.86, which is almost as bad as the current situation! It actually doesn't smooth anything out in this case and instead continues the alternating pattern of 10 pixels and 20 pixel displacements we had before. Bummer.

Midpoint of Last Two Touches With Sampling

What about trying to use a previous interpolated touch to smooth out the touch events? This version always uses the last two touch events and creates a single touch using the midpoint if they exist in the current vsync. If the current vsync only has one touch event, we interpolate the current touch event plus the previous frame's sampled touch event. This is a slight variation of the Midpoint of the Last Two Touches algorithm in that we now incorporate previous samples, not just previous touch events. How does this look:

Wow we made it worse! We see large jumps of up to 23 pixels and really small jumps of only 7 pixels. The standard deviation between frames jumped to 7.27. We made it worse! Ok, we probably shouldn't do extra damage here.

Last Sample and Last Touch

In this last algorithm, we touch interpolate the previous frame's sample and the current frame's latest touch event. If we have two touch events in a single vsync, we use the latest touch event plus the previous frame's sample, ignoring the middle touch event. If the vsync has one touch event, we touch interpolate the current frame's touch event with the previous frame's resampled touch. How does this look:

Wow, that looks a bit smoother huh. Interestingly, it lags behind a frame in the beginning since we resample the previous frame, and we see a nice oscillating frame displacement of ~13-17. We see that it improves the Frame Uniformity standard deviation to 3.05.

If we evaluate all the algorithms, using only the Frame Uniformity standard deviation metric, it looks like just using the midpoint of the last two touches wins right? It has the lowest standard deviation at 2.2. The next best one, Last Sample and Last Touch(LSLT) has a standard deviation of 3.05, so should be an easy win? Almost, if only it was that easy.

Evaluating Midpoint versus Last Sample and Last Touch and Touch Responsiveness

If we look closely at using the Midpoint algorithm, what we see is a consistent drag of 15 pixels, one pixel behind the ideal 16 pixels per frame, with a jump up to 20 pixels every 3-4 frames to catch up. When we look at the Last Touch and Last Sample algorithm, we see a huddle of 15, to 17, to 14, to 17 pixels. The LSLT algorithm is better at keeping up with the ideal 16 pixels and doesn't fall behind as quickly.

Visually, with the Midpoint algorithm, we'll see perfect smoothness then a single jank, then back to smoothness. However, since this occurs every 3-4 frames, or every 48 - 60 ms, this visually still looks like jank quite often. With the Last Touch and Last Sample algorithm, we have a constant jank of 1-3 pixels per frame offset from the ideal. However, visually, LSLT is smoother because the difference between frames is less. For example, with the Midpoint algorithm, we have one large jank of 5 pixels every 3-4 frames. With the Last Touch and Last Sample algorithm, we're off of the ideal by 1 pixel one frame, then 1 pixel again, then 2, then 1. The jank is amortized over a couple of frames.

Imagine watching a car driving in a perfectly straight line. If the car swerved just a little bit for one second then went back to a perfect straight line, you can easily notice the car driving out of the lane. However, if the car was always alternating just a little but, but was mostly straight, each individual change mostly indiscernible, it would seem smoother. This is kind of the difference between the two algorithms.

Numerically, this is an interesting difference between the Last Sample and Last Touch (LSLT) algorithm and the Midpoint algorithm. Remember that the ideal displacement is in increments of 16 pixels per frame. Let's take a look at the absolute displacement of the ideal case, the Midpoint algorithm, and the Last Sample and Last Touch algorithm:

Ideal:         [16, 32, 48, 64, 80, 96, 112, 128, 144, 160]
Midpoint:  [10, 25, 40, 55, 75, 90, 105, 120, 135, 155]
LSLT:         [10, 20, 30, 45, 62, 76,  93,  106, 123, 141]

Hmm, it looks like the Midpoint algorithm is much better at tracking the ideal case. However, the numbers from Last Sample Last Touch look pretty interesting. They look close to increments of 16, just one frame behind. Let's take a look again by trailing one frame behind:

Ideal:     [16, 32, 48, 64, 80, 96, 112, 128, 144, 160]
LSLT:     [20, 30, 45, 62, 76, 93, 106, 123, 141]

Wow that is really close to ideal isn't it! Let's see how much closer. Let's take the difference from the ideal for each frame. For the Midpoint algorithm, we'll match it with the current frame. For the LSLT algorithm, we'll match it with one frame behind. For example, for the second frame, we'll take 32 - 25 = 7 frame difference for the Midpoint algorithm. For LSLT, we'll take (32 - 30 = 2) frame difference.

Midpoint: [6, 7, 8, 9, 5, 6, 7, 8, 9, 5]. Average = 7 px.
LSLT:        [4, 2, 3, 2, 4, 4, 6, 5, 3]. Average = 3.8 px.

With the LSLT algorithm, we're very close to an ideal scroll, just one frame behind, with an average of 3.8 pixels away from the ideal. In addition, if we measure Frame Uniformity by disregarding the first two frames between the Midpoint algorithm and the Last Sample Last Touch algorithm, we see an improvement in Frame Uniformity as well. Frame Uniformity for the Midpoint algorithm increases from 2.2 to 2.44 whereas LSLT decreases from 3.05 to 1.86, outperforming the Midpoint algorithm. Thus the LSLT algorithm has a few interesting characteristics. The first couple of frames will be worse but the middle and end will be much better. Since an average scroll takes a couple of hundreds of milliseconds, a user would only see a couple of early frames as slow, then overall much better. While in the middle of a scroll, we will track your finger pretty well, and pretty smoothly. The trade off is that we'll add 16.6 ms of latency and be somewhat behind in terms of displacement. Overall, scrolling will be smooth, but it will feel a little less responsive when tracking your finger.

Here are is a visual comparison of the last few frames comparing the ideal, the Midpoint algorithm, and the Last Sample Last Touch algorithm in terms of smoothness.

Conclusion

Does any of this matter? Does a Frame Uniformity decrease of 5 to 2 actually mean anything? What about adding one frame of latency? Below is a video of the Last Sample Last Touch interpolation algorithm on a Flame device with and without touch interpolation. Make sure you push the HD button. *Hint: Try to see which one you think is smoother and the answer for which device has touch interpolation is at the bottom. It's difficult to see the "smoothness" on a video, but in person it is very noticeable. Since I'm swiping my finger back and forth, you can see the touch latency as well. Is it worth it to trade smoothness for latency? I'm not sure, but at least it's smooth.

The last question is, is it noticeable in day to day use? Smoothness while tracking your finger is dramatically improved. Smoothness while flinging or already in a scroll is an animation and is somewhat affected by this. Scrolls will be appear slower because we think we're scrolling slower and thus calculate a slower velocity. Smoothness while slowing down a scroll is unaffected as that is also an animation. However, in terms of smoothness while tracking your finger, it's pretty awesome.

Make sure to push the HD button for moar HDs!
 

Appendix:

All the algorithms on one giant graph:

* The device on the left has touch interpolation.