A Short Walkthrough of WebRender 2

I've been learning quite a bit about WebRender, a new research rendering engine for web content. At its core, WebRender philosophically uses the GPU to draw web content instead of the CPU which has traditionally been used. Currently Firefox uses an immediate mode API to draw content (Note immediate / retained mode are very broad and overloaded terms). An immediate mode API usually requires a context object, the programmer sets different states on the context object, and then you draw things onto the screen one item at a time. A retained mode API, which WebRender is using, is slightly different. Instead, rendering is done in two passes. First, you set up and upload data to the GPU and issue draw calls to the GPU to render, but nothing is actually drawn yet. Next, you tell the GPU "draw everything I told you about". This is a fundamentally different way to think about programming, which has actually been wrecking my brain the past couple of days. WebRender itself takes as an input display items and outputs pixels to the screen. It takes a complete description of the layout up front and analyzes how best to optimize the commands to send to the GPU.

Let's take a high level overview of what's going on with WebRender.

We got a couple of steps here that we'll delve into a bit further, but the overview is:

  1. Servo performs layout, which generates Display Items. Display items are just descriptions of what to render like "Text" and "Images".
  2. WebRender takes these display items and converts them to a Primitive type, which is a 1:1 representation from layout's Display Items. These could technically be the same as the Display Items in (1), but sometimes it's easier to deal with a different data structure.
  3. WebRender assigns these Primitives to Layers, which are not the same concept as layers in Gecko. Instead Layers here are Stacking Contexts.
  4. We then generate screen tiles, which at the moment are uniform 64x64 device pixel tiles.
  5. Assign the Primitives in each Layer to it's appropriate screen tile.
  6. Break down the Primitives per tile into Packed Primitives. Packed Primitives are the same Primitive information broken down into GPU uploadable data.
  7. Update the list of GPU resources that need to be updated.
  8. Notify the Compositor thread (which is the main thread), we're ready to render.
  9. The main thread / Compositor issues OpenGL draw commands and updates GPU resources based on the information from (7). All actual GPU interactions occur on the main thread.
  10. ???
  11. Profit!

Let's walk through what is actually happening for a small test case.

div {
  width: 100px;
  height: 50px;
  background-image: url("50px.png");

In this HTML test case, all we're trying to do is render an image onto the screen. Currently, WebRender only works inside Servo, so we have to load the test case through Servo to do actual layout and generate the Display Items for us. We can tell Servo to dump the display list for us and we get this:

┌ Display List
│  ├─ Items
│  │  ├─ SolidColor rgba(0, 0, 0, 0) @ Rect(1024px×740px at (0px,0px))
│  │  ├─ SolidColor rgba(0, 0, 0, 0) @ Rect(1008px×50px at (8px,8px))
│  │  ├─ SolidColor rgba(0, 0, 0, 0) @ Rect(200px×50px at (8px,8px))
│  │  └─ Image @ Rect(200px×50px at (8px,8px))
│  ├─ Stacking Contexts
│  │  ├─ Layered StackingContext at Rect(1024px×740px at (0px,0px)

We have three SolidColor rects along with one Image item within one Stacking Context. Let's focus only on the Image display item for now. We start by iterating over each Display Item given to us by layout and converting the Display Item to a Primitive type. Adding an image converts the Display Item information to the ImagePrimitive. Next, we assign these Primitives to Layers, which are 1:1 to Stacking Contexts. From the display list dump, we only have one Stacking Context so we'll only have one Layer in this case. However, the wrinkle here is that we do this based on tiles. WebRender splits the screen into uniform 64x64 tiles. For each tile, we check if the Layer intersects the tile. If so, for every Primitive in the layer, we check whether a Primitive intersects the tile, and so store which Primitives exist on the tile. This makes it easier for every tile to know what it needs to actually render.

Once we have all the Primitives for each tile, we go through each tile and create the PackedPrimitive that's local to each tile. Let's assume for simplicity that device pixels are equal to CSS pixels, e.g. 1px in CSS specification means 1 physical pixel on the monitor (which doesn't always happen, but it's easier for this). For example, since this test case case is 100x50, this test case will span two 64x64 tiles horizontally and 1 tile vertically for 2 total. The PackedPrimitive has information about which tile it's assigned to which will be used later in the shader. What's critical here is that WebRender will upload the same PackedImagePrimitive N number of tiles times, with really the only difference being the tile id. In addition, it's uploading some data called st0 and st1. These are references to the top left and bottom right coordinates of the image in a Texture Atlas. A Texture Atlas is one giant image composed of other smaller images. Instead of creating a separate GPU buffer for just the image, WebRender uses one giant texture with parts of the Texture Atlas used for individual images. The vertex shader will reference these coordinates later as sampling coordinates to render the actual image from the Texture Atlas. The list of Packed Primitives that are of the same type to render are called batches.

Next, WebRender does some notetaking on what needs to be rendered and notifies the Compositor thread, which is really the main thread of the process, that we need to update the screen. ProTip: Doing GPU operations not on the main thread means you're going to have a bad time. Once the compositor is notified, WebRender finally starts doing actual graphics things with OpenGL! The biggest brain wreck for me was realizing that with GPUs, you issue commands to the GPUs but you don't know when the GPU will actually execute the commands. In reality, it means you're queuing up work to do, and it will execute sometime in the future. Second, with shaders, you map vertices to different points on a [0..1] axis and transform them onto the screen. This makes GPUs very hard to debug as you can't actually single step through your shader to see what is going on.

If you already know OpenGL, you can skim some of this section. Otherwise, WebRender first compiles all the shaders, which are unique to the PackedPrimitiveType. In our example, we have a dedicated ps_image vertex shader that gets compiled and run whenever we need to draw an image. We also set up our vertices to be two triangles that split a quad from [0,0] to [1,1]. This gets bound to the 'aPosition' Vertex Attribute, which we'll use in the vertex shader. Then we go through every batch and start issuing GL calls. For images, we'll start here. The chunk data here is the PackedPrimitiveData for each tile. Thus for each tile we'll issue a draw call, but the draw call is an instanced draw call, which is a key rendering concept for WebRender. An instanced draw call increments gl_InstanceID that is accessible in the vertex shader. This let's us draw multiple times of the same indices without issuing new draw commands to the GPU, speeding up performance. Finally, we can take a look at the image vertex shader:

void main(void) {
    // gl_InstanceID is auto incremented due to instanced call.
    Image image = images[gl_InstanceID];
    Layer layer = layers[image.info.layer_tile_part.x];
    Tile tile = tiles[image.info.layer_tile_part.y];

    // This vertex's position within the CSS box. 
    // aPosition iterates through the 4 corners from [0,0] to [1,1]
    vec2 local_pos = mix(image.local_rect.xy,
                         image.local_rect.xy + image.local_rect.zw,

    vec4 world_pos = layer.transform * vec4(local_pos, 0, 1);

    vec2 device_pos = world_pos.xy * uDevicePixelRatio;

    // Clamp the position to the tile position
    vec2 clamped_pos = clamp(device_pos,
                             tile.actual_rect.xy + tile.actual_rect.zw);

    // Get the tile's position in relation to the CSS box.
    vec4 local_clamped_pos = layer.inv_transform * vec4(clamped_pos / uDevicePixelRatio, 0, 1);

    // vUv will contain how many times this image has wrapped around the image size.
    // These 3 variables are passed onto the fragment shader.
    vUv = (local_clamped_pos.xy - image.local_rect.xy) / image.stretch_size.xy;
    vTextureSize = image.st_rect.zw - image.st_rect.xy;
    vTextureOffset = image.st_rect.xy;

    // Transform the position onto the screen.
    vec2 final_pos = clamped_pos + tile.target_rect.xy - tile.actual_rect.xy;

    gl_Position = uTransform * vec4(final_pos, 0, 1);

The vertex shader reads an Image structure, which then lets the shader reads the tile it's currently rendering. The positions on the 4 corners of the CSS image here (the corners of the 100x50 test case), get clamped to the tile boundaries, for every tile. Finally, it maps the tile boundaries to a position within the Texture Atlas and passes three variables (vUv, vTextureSize, and vTextureOffset to the fragment shader. Note that vTextureSize here are sizes as reported in the Texture Atlas, not the actual size of the image, which is bound between [0..1]. With OpenGL, you only map points to the texture. The vertex shader is mapping the 4 corners of each tile to a specific location of the CSS box. These locations of the CSS boxes map to a specific point within the background 50x33 pixel image. We're just telling the GPU where each of these points are respectively. Consider the illustration:

Consider the tile boundary on the top right of this tile. This tile corner is some position, aPoint, within the CSS image we wanted to render, the original 100 px x 50 px image. Based on the location of aPoint, we map it to the location of the actual image in the Texture Atlas.

void main(void) {
    vec2 st = vTextureOffset + vTextureSize * fract(vUv);
    oFragColor = texture(sDiffuse, st);

Lastly, we're at the fragment shader! If we look back at how we calculated the three variables vUv, vOffset, and vTextureSize. vTextureSize is the size of the texture within the Texture Atlas. vOffset is where the start of the image is within the Texture Atlas. vUv is the only interesting part, which calculates how many times an image should be rendered at the current tile corner. In the sample case, at tile 1 with a X offset of 64, we've rendered the image (64px - tile offset / 50 - image size) or 1.28 times. This means the image needs to wrap around and repeat again. Thus in the fragment shader, we take the fractional part of this calculation (0.28) and say this point maps to 0.28% from the left of the source image. We just mapped the tile corners to the appropriate points within the source image. The GPU does the rest and interpolates everything else in between. On a side note, the fragment shader gets called for every pixel whereas the vertex shader gets called for every vertex. Since we only supplied 6 vertices for two triangles to draw one rectangle, the vertex shader would be called 6 times. We have to do the fract() calculation in the fragment shader since the actual value vUv will be interpolated between the values we supplied in the vertex shader for vUv at the appropriate 6 indices. (e.g. if at vertex [0,0] we supplied a vUv value of 0.3 and at vertex [1, 0], we supplied a vUv value of 0.6, every call to the fragment shader for pixels between [0,0] and [1,0] will be an interpolated value between 0.3 and 0.6.).

Whew that was a lot and a crazy brain dump because GPUs are confusing. There's still a bunch more to WebRender such as compositing the Stacking Contexts together when they need blending, but I haven't gotten there yet. You can also learn a bit more on Air Mozilla.

Thanks very much to Glenn Watson for proofreading and explaining so many things to me.