Vertex Array Performance

There have been some questions about the performance benefits of using vertex array objects (VAOs) recently. Vertex array objects are part of the core OpenGL specification and you must have one bound in order to draw anything. This change to the specification instituted quite a bit of complaining from fans of legacy OpenGL, and some of the experts in the field have even made claims that using vertex array objects is slower than just using traditional vertex attributes on all implementations. I was fairly certain that this wasn’t true, so I decided to construct a benchmark to test that assumption.

Comparison

First, we need to figure out what we’re comparing. On the one hand, we have traditional vertex attributes. We’ll be using the glVertexAttribPointer and glEnableVertexAttribArray functions quite a bit here. Applications will make a sequence of calls to glVertexAttribPointer and glEnableVertexAttribArray, possibly interspersed with calls to glBindBuffer or glDisableVertexAttribArray. If vertex attributes are stored interleaved in a single buffer, there would likely be one call to glBindBuffer followed by a bunch of calls to glVertexAttribPointer and glEnableVertexAttribArray or glDisableVertexAttribArray. For interleaved attributes in a single buffer, the code might look like this:

glBindBuffer(GL_ARRAY_BUFFER, vertex_buffer);
for (i = 0; i < num_attributes; i++)
{
    if (attrib[i].enabled)
    {
        glVertexAttribPointer(i,
                              attrib[i].size,
                              attrib[i].type,
                              attrib[i].normalized,
                              attrib[i].stride,
                              attrib[i].offset);
        glEnableVertexAttribArray(i);
    }
    else
    {
        glDisableVertexAttribArray(i);
    }
}

If each of the vertex attributes is in its own buffer, then the setup code will look similar, but with a buffer bind moved into the per-attribute loop. It might look something like this:

for (i = 0; i < num_attributes; i++)
{
    if (attrib[i].enabled)
    {
        glBindBuffer(GL_ARRAY_BUFFER, attrib[i].buffer);
        glVertexAttribPointer(i,
                              attrib[i].size,
                              attrib[i].type,
                              attrib[i].normalized,
                              attrib[i].stride,
                              attrib[i].offset);
        glEnableVertexAttribArray(i);
    }
    else
    {
        glDisableVertexAttribArray(i);
    }
}

This code would be executed before each draw (or at least whenever the vertex attributes or buffer bindings change). On the other hand, a vertex array object retains all of this state. Code such as that shown above would be executed once to initialize a vertex array object, and then on each attribute switch, a call to glBindVertexArray would be executed in order to restore that state.

So now we have a few things to compare. We can compare vertex attributes interleaved into a single buffer or spread across multiple buffers, and we can compare configuration of this state with and without a VAO. Furthermore, the relative performance of using the VAO may be dependent on the number of vertex attributes configured. All OpenGL implementations support a minimum of 16 vertex attributes, and so we can determine the performance of each configuration using between 1 and 16 attributes.

The Benchmark

Now that we know what we want to measure, we can construct our test case. We will examine four cases. The first case is using traditional calls to glVertexAttribPointer and glEnableVertexAttribArray with varying number of attributes packed into a single buffer. The second case also uses traditional API calls, but with attributes packed into separate buffers, necessitating a call to glBindBuffer between each. The third and fourth cases configure the vertex attributes in a similar manner, but store the state in a VAO. Before each draw, they execute a single call to glBindVertexArray.

On each iteration of the benchmark, we’ll configure the vertex attributes and then draw a single point. The goal here is to be bound by software performance and to reduce the influence of the GPU as much as possible. The draw exists to ensure that the OpenGL implementation processes the requested state changes. The benchmark I constructed performs 1000 vertex attribute configure—draw cycles in a loop and then executes that loop until a second has passed. It then calculates the total number of iterations it was able to execute in one second. The measurement is taken for each attribute setup method (4 of them) and for varying numbers of attributes between 1 and 16.

The Results

The results are shown below. I took measurements using an NVIDIA GTX 580 and an AMD Radeon HD 7970 using the latest publicly available drivers from each respective vendor. The CPU used for the test was an Intel Core i7 960 running at 3.2 GHz and the machine was running 64-bit Windows 8. The first graph shows the absolute performance numbers. The X axis represents the number of vertex attributes configured and the Y axis represents the number of loop iterations executed per-second — higher is better.

Vertex Arrray Object Raw Performance

Vertex Arrray Object Raw Performance

As expected, with traditional attributes, the number of iterations-per-second decreases as the number of vertex attributes increases. This is also the case with VAOs (to a lesser degree) on the NVIDIA implementation, but on the AMD implementation, the cost of switching VAO appears roughly constant. Regardless, we can see that using a VAO is consistently faster than not using one, regardless of implementation.

The following graph shows the relative performance, which is what we are interested in (the goal here is not to compare implementations to one another, but to compare the use of a VAO to not using one). At each data point, I’ve plotted the ratio of performance using a VAO to using traditional vertex attribute setup for the same attribute configuration. Numbers higher than 1 indicate that using a VAO is faster than not using one. Again, we can consistently see that irrespective of implementation, using a VAO is better than not using one.

Vertex Array Object Relative Performance

Vertex Array Object Relative Performance

In the graph above, all implementations do substantially better at switching VAO than at reconfiguring vertex attributes using independent API calls. Even in the worst case (a single attribute on the AMD implementation), switching VAO is at least twice as fast as making individual API calls. The relative performance of using a VAO over not doing so steadily increases as the number of vertex attributes increases.

Analysis and Conclusion

You can see from the results above that, at least for our small sample set, VAO is faster on all implementations. It stands to reason — there are less parameters to validate when calling glBindVertexArray than either glBindBuffer or glVertexAttribPointer. Even when there is only a single vertex attribute, there are half as many calls into OpenGL with a VAO switch than with explicit update of a global VAO. Besides the obvious “fewer API calls means faster execution” relationship, the VAO is a place that an OpenGL driver can stash information required to program the underlying GPU. The total amount of state changes sent to the GPU is the same either way.

Depending on the number of vertex attributes, appropriate use of a VAO can be between 2 to 15 times faster than just creating a dummy VAO, leaving it bound and mutating it on demand. On the AMD implementation, using a VAO produces roughly constant performance regardless of the number of vertex attributes. On the NVIDIA implementation, performance of the VAO drops off as the number of vertex attributes increases. Even so, the relative performance of the VAO over directly mutating vertex attributes is still substantially better. On NVIDIA, using a VAO with a single buffer produces a relatively constant performance improvement over traditional vertex attributes, whereas on AMD, the relative improvement continues to increase even for interleaved attributes. This suggests that the AMD implementation of glVertexAttribPointer is substantially more expensive than that of NVIDIA.

A common usage pattern seen in rendering engines is that several meshes share a common vertex format layout but each use their own buffer or buffers. To implement this model with VAO, you do need to re-specify the vertex format state (call glVertexAttribPointer) whenever the buffer bindings change, eliminating its benefit. There are ways to do this efficiently in recent versions of OpenGL. For example, see glBindVertexBuffer and glVertexAttribFormat. I don’t have a benchmark for these functions. I’ll create one and share the results in a future post. However, these functions were introduced in OpenGL 4.3 and so may not have widespread support just yet, whereas VAO has been in core OpenGL since version 3.0. Even where the newer functions are available, you’d still need a call to glBindVertexBuffer for each buffer referenced by the draw, suggesting that using this method may still be quite a bit more expensive than switching VAO.

Now, the benchmark I constructed for this is admittedly pretty crude. It does not check for correctness and so could quite well either contain a bug or be hitting a bug in one or both of the tested implementations. I wrote this against the sb6 framework that I used in the book, which wasn’t really designed for benchmarking. However, the results in the graph do seem to make sense, and the application does not generate errors on either implementation. I will tidy it up, validate the results and post the source online in the OpenGL SuperBible GitHub Respository. Although the newer glBindVertexBuffer and glVertexAttribFormat functions may produce acceptable performance if your usage model is already to switch vertex buffers at high frequency (say, if you’re doing a simple port from Direct3D), I would say that the advice to simply “skip VAOs” quite possibly misinformed.

OpenGL, Optimization , ,

Leave a Reply

Your email address will not be published. Required fields are marked *


4 − 3 =

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>