Memory bandwidth is a precious commodity. As far as graphics cards are concerned, this is the rate at which the GPU can transfer data to or from memory and is measured in bytes (or more likely, gigabytes) per second. The typical bandwidth of modern graphics hardware can range anywhere from 20 GB/s for integrated GPUs to over 300 GB/s for enthusiast products. However, with add-in boards, data must cross the PCI-Express bus (the connector that the board plugs into), and its bandwidth is typically around 6 GB/s. Memory bandwidth affects fill rate and texture rate, and these are often quoted as performance figures when rating GPUs. One thing that is regularly overlooked, though, is the bandwidth consumed by vertex data.

## Vertex Rates

Most modern GPUs can process at least one vertex per clock cycle. GPUs are shipping from multiple vendors that can process two, three, four and even five vertices on each and every clock cycle. The core clock frequency of these high end GPUs hover at around the 1 GHz mark, which means that some of these beasts can process three or four *billion* vertices every second. Just to put that in perspective, if you had a single point represented by one vertex for every pixel on a 1080p display, you’d be able to fill it at almost 2000 frames per second. As vertex rates start getting this high, we need to consider the amount of memory bandwidth required to fill the inputs to the vertex shader.

Let’s assume a simple indexed draw (the kind produced by glDrawElements) with only a 3-element floating-point position vector per-vertex. With 32-bit indices, that’s 4 bytes for each index and 12 bytes for each position vector (3 elements of 4 bytes each), making 16 bytes per vertex. Assuming an entry-level GPU with a vertex rate of one vertex per-clock and a core clock of 800 MHz, the amount of memory bandwidth required works out to 12 GB/s (16 * 800 * 10^6). That’s twice the available PCI-Express bandwidth, almost half of a typical CPU’s system memory bandwidth, and likely a measurable percentage of our hypothetical entry-level GPU’s memory bandwidth. Strategies such as vertex reuse can reduce the burden somewhat, but it remains considerable.

## High Memory Bandwidth

Now, let’s scale this up to something a little more substantial. Our new hypothetical GPU runs at 1 GHz and processes four vertices per clock cycle. Again, we’re using a 32-bit index, and three 32-bit components for our position vector. However, now we add a normal vector (another 3 floating-point values, or 12 bytes per vertex), a tangent vector (3 floats again), and a single texture coordinate (2 floats). In total, we have 48 bytes per vertex (4 + 12 + 12 + 12 + 8), and 4 billion vertices per-second (4 vertices per-clock at 1 GHz). That’s 128 GB/s of vertex data. That’s 20 times the PCI-express bandwidth, several times the typical CPU’s system memory bandwidth (the kind of rate you’d get from `memcpy`

). Such a GPU might have a bandwidth of around 320 ^{1} GB/s. That kind of vertex rate would consume more than 40% of the GPU’s total memory bandwidth.

Clearly, if we blast the graphics pipeline with enough raw vertex data, we’re going to be sacrificing a substantial proportion of our memory bandwidth to vertex data. So, what should we do about this?

## Optimization

Well, first, we should evaluate whether such a ridiculous amount of vertex data is really necessary for our application. Again, if each vertex produces a single pixel point, 4 vertices per clock at 1 GHz is enough to fill a 1080p screen at 2000 frames per second. Even at 4K (3840 * 2160), that’s still enough single points to produce 480 frames per second. You don’t need that. Really, you don’t. However, if you decide that yes, actually, you do, then there are a few tricks you can use to mitigate this.

- Use indexed triangles (either
`GL_TRIANGLE_STRIP`

or`GL_TRIANGLES`

). If you do use strips, keep them as long as possible and try to avoid using restart indices, as this can hurts parallelization. - If you’re using independent triangles (not strips), try to order your vertices such that the same vertex is referenced several times in short succession if it is shared by more than one triangle. There are tools that will re-order vertices in your mesh to make better use of GPU caches.
- Pick vertex data formats that make sense for you. Full 32-bit floating point is often overkill. 16-bit unsigned normalized texture coordinates will likely work great assuming all of your texture coordinates are between 0.0 and 1.0, for example. Packed data formats also work really well. You might choose
`GL_INT_2_10_10_10_REV`

for normal and tangent vectors, for example. - If your renderer has a depth only pass, disable any vertex attributes that don’t contribute to the vertices’ position.
- Use features such as tessellation to amplify geometry, or techniques such as parallax occlusion mapping to give the appearance of higher geometric detail.

In the limit, you can forgo fixed-function vertex attributes, using only position, for example. Then, if you can determine that for a particular vertex you only need a subset of the vertex attributes, fetch them explicitly from shader storage buffers. In particular, if you are using tessellation or geometry shaders, you have access to an entire primitive in one shader stage. There, you can perform simple culling (project the bounding box of the primitive into screen space) using only the position of each vertex and then only fetch the rest of the vertex data for vertices that are part of a primitive that will contribute to the final scene.

## Summary

Don’t underestimate the amount of pressure that vertex data can put on the memory subsystem of your GPU. Fat vertex attributes coupled with high geometric density can quickly eat up a double-digit percentage of your graphics card’s memory bandwidth. Dynamic vertex data that is modified regularly by the CPU can burn through available PCI bandwidth. Optimizing for vertex caches can be a big win with highly detailed geometry. While we often worry about shader complexity, fill rates, texture sizes and bandwidth, we often overlook the cost of vertex processing — bandwidth is not free.

Hi Graham!

Nice post.

I just wanted to add some resource on reducing vertex size.

As I mentioned in my Ogre 2.0 proposal (pages 115-116), both CryEngine 3 and Just Cause 2 devs have shared their knowledge about how to trim the vertex size:

* Both devs use 16-bit data for position

* Both devs use 16-bit data for UVs (Just Cause 2 uses two pairs of UVs)

* CryEngine encodes normal and tangent data using what they refer as to “QTangents” which is basically sending a Quaternion with a few tricks for corner cases.

* JC2 encodes normal and tangent data using spherical coordinates (page 23) in lower precision.

The end result is a reduction to 20-24 bytes per vertex while still maintaining a lot of information.

There are also other ways of encoding normals (for example Crytek’s best-fit-normal technique, which is used by them to store normals in the G-Buffer)

Cheers!

MatÃas