The Road to One Million Draws

Recently I posted this video on YouTube and tweeted the link:

This is a capture of the multidrawindirect sample from our example code package. It’s a pretty simple example that just puts a whole bunch of draws in one big buffer and blasts them at OpenGL with a single call to glMultiDrawArraysIndirect. The application achieves a rate of roughly 3.5 million independent draws per-second and is limited by the ability of the tested hardware (an AMD Radeon HD 7970) to process vertices. By making each draw much smaller, I’ve seen rates of consumption of 20 to 40 million draws per-second, depending on the underlying hardware. It seems that we’re not that far off from being able to push through one million draws per frame at a steady 60Hz.

What ensued after I posted the video was a cavalcade of comments and questions about this feature, and how useful it might be given that you can’t put any traditional state changes between the draws. This is true. You can’t, for example, bind new textures, change the sense of the depth test, or change the blending functions. Even if you could somehow change those states through some form of API enhancement, there are architectural hardware reasons 1 why you can’t blast through a million state changes per frame at 60Hz. Obviously, this kind of performance feature puts a damper on some of the traditional graphics engine functionality that you might assume to be present. However, with a little out of the box thinking, we can actually achieve most of what we want.

Per-Draw Data

The glMultiDrawArraysIndirect function effectively acts as if glDrawArrays has been called a whole bunch of times and the parameters are passed from an array of structures stored in GPU memory. Each member of the structure looks like this:

typedef  struct {
    uint  count;
    uint  instanceCount;
    uint  first;
    uint  baseInstance;
} DrawArraysIndirectCommand;

The members of the structure correspond to the similarly named parameters to glDrawArrays. In the sample from the book, I used the baseInstance of the DrawArraysIndirectCommand structure as a draw index. Then, I set up an instanced input to the vertex shader and configured this attribute to read from the same buffer as was used to store the indirect draw commands. That is, the instanced integer input to the vertex shader was fed from the baseInstance member of the structure. The baseInstance member of each structure was then filled with its own index in the array of structures. The result was that the input the vertex shader was fed the index of the draw in the list. Then, I used that value to programmatically generate the transformation matrix and color of each asteroid.

Now, there’s nothing to say that I needed to point that instanced vertex attribute at the baseInstance field of the DrawArraysIndirectCommand structure. That was just a convenient unique-per-draw value that was already in memory. I could just as easily pointed it at another buffer containing only parameters and those would have been feed to the vertex shader along with any of the regular vertex attributes. If all you really need is the index of the draw (with which you can do whatever you like), then you might want to check out the GL_ARB_shader_draw_parameters extension, which exposes several new parameters to the shader, including the draw index as gl_DrawIDARB. However, this extension is fairly new and it may take a while to see universal support.

Per-Draw Constants

If all you need is one or two values per-draw, your best bet is probably to hook up the baseInstance field as I describe above, or to use the gl_DrawIDARB variable exposed by GL_ARB_shader_draw_parameters extension. If you have more constant data than that (such as bone matrices) or if the data is required in the fragment shader (such as material properties), you can pack that data into a large uniform block and index into it using the per-draw parameter derived earlier. For vertex shader use, just index into the uniform block. For shader stages beyond the vertex shader (tessellation, geometry or fragment), pass the index data along the pipe and use it in the target shader stage.

If you have a seriously huge amount of data required per draw, then it might be an idea to use shader storage buffers. These are large buffers that can store unbounded arrays of structures, and may be written to as well as read. Depending on the architecture you’re running on, the performance of loading from a shader storage buffer may be lower than that of uniform buffers. If possible, use uniform buffers. Of course, you can also store data in textures or texture buffers, if that suits your purposes. You may be able, for example, to take advantage of texture compression by storing parameter data in compressed textures.

Per-Draw Textures

There are essentially three ways to use a different texture or set of textures per draw. The first is to use an array of textures in the shader. Because gl_DrawIDARB and anything derived from it, even implicitly (i.e., anything you can guarantee is constant across a draw), is considered dynamically uniform, then you’re safe to index into arrays of textures in any shader stage. For simple cases, you might index into an array of textures using gl_DrawIDARB as the index. For more complex cases, you could index into that array using some property of a material, which may be stored in a uniform block as described above. With this method, however, you’re going to be limited by the number of traditional texture binding points supported by the OpenGL implementation, which is typically in the range of 16 to 32 per shader stage.

To break the limit imposed by traditional texture binding points, we can move to array textures. These are textures that have several layers. The layers must all be the same size and format, but texture arrays can be really, really big — easily big enough to exhaust the memory of the GPU. If your standard material consists, say, of a diffuse color map, a normal map, and some other data such as specular coefficients, then you might consider using three array textures, with one layer assigned for each material. You can then just index into the array using the material index. If the textures for various materials are of different sizes, you can load them into smaller mip levels of the array texture’s levels and apply a mip bias per-material. The GL_ARB_sparse_texture extension can help mitigate the wasted virtual address space consumed by the unused high-resolution mip levels by simply making the unused levels non-resident.

To take it further, we can do away with traditional binding points altogether, and start using bindless textures. This is exposed by the GL_ARB_bindless_texture allows you to effectively use an unlimited number of different textures at a time. The same rules apply with regards to divergent behavior in accesses to bindless textures as to arrays of textures, but otherwise they appear as regular textures to the shader — except that they can be stored in uniform blocks or shader storage blocks, and then indexed using per-draw parameters passed with any of the methods discussed so far.

Per-Draw Depth and Blend Functions

Now we’re getting to the tricky part. How do we change the enable or disable blending in the middle of a render? Well, most rendering engines will sort surfaces into buckets and render transparent materials last. If this is the case, we’re done. Just make two calls to glMultiDrawArraysIndirect, one for opaque surfaces and one for transparent surfaces and don’t worry about it. The CPU side cost of glMultiDrawArraysIndirect is about the same as a regular call to glDrawArrays, and you won’t be software bottlenecked here. Another alternative is to use an order independent transparency technique, such as the per-fragment linked-list approach. Insert all fragments into a list (including the opaque ones, perhaps) and then resolve them in a final pass.

If you really must use blending, unsorted, amongst all your other rendering calls, then just leave blending on! Quite possibly, the most common blending configuration is:


If you simply enable blending, but ensure that all of your opaque fragments have an alpha of one, then the resulting image will be the same. If you need some other blending configuration, there may be cases where you can output ones, zeros or even other values in either the color or alpha channels which will have the same effect as a non-blended fragment. The cost of ensuring that in your shaders likely outweighs the cost associated with API-level state changes.

Depth test changes might be a little tricker. In reality, most rendering is done either with the depth test off altogether, or with a single function such as GL_LEQUAL. To exclude certain parts of the scene from depth testing, it may be possible to modify the front end transformations to push the geometry up against the near plane to ensure that it passes, or to write to gl_FragDepth to achieve a similar effect. If all else fails, break draw sequences into those that need depth testing and those that don’t.

Per-Draw Shaders

Realistically, there isn’t a way to change shaders per draw and keep the draw issue rate anywhere near the maximum possible in hardware. What we can do, though, is to use the übershader technique. The term übershader refers to a shader that encompasses a large number of use cases in one huge chunk of code. Such a shader might have a switch statement to determine its functionality and then branch to perhaps radically different paths. A relatively elegant way of implementing an übershader is to use shader subroutines, which are the GLSL equivalent of function pointers. These can be formed into arrays and different implementations of the subroutine can have entirely different behavior.

As an example, we could declare a fragment shader that looks something like this:

in int material_index;

subroutine void mainType(void);

subroutine uniform mainType materialFunc[10];

void main(void)

Next, we add the implementations of main for each of the materials we want to support. The material_index input is supplied by a previous stage (typically the vertex shader stage, but it could be the geometry or tessellation stages). The downside of this technique is that the compiled and linked program may allocate the worst-case resource requirements of all possibly called subroutines for every invocation of the shader. While this does avoid the cost of switching shaders between draws, offloading workload to the GPU, you might want to consider separating draws that require extremely complex shaders from those that require very simple shaders.

Generating Draws

You can see that there are several ways to arrange or organize your code in order to allow draws to be amalgamated. The question is now, what to do with this new found power. Many rendering engines have the concept of a scene graph where objects in the scene are represented in some form of hierarchy that is traversed by the CPU. As the CPU traverses the scene graph, it may cull objects and send the potentially visible geometry to the renderer to be drawn. To convert this into a sequence of indirect draws, we can map the indirect draw buffer, write the parameters into the buffer as they are generated and at the end of traversal, unmap the indirect draw buffer and issue the draw. Because there are no API calls during scene traversal, it’s possible to traverse the scene in a separate thread, or even in multiple threads and not have to worry about multiple OpenGL contexts, object sharing, mutexes or other synchronization overhead.

Speaking of parallelism, there’s really nothing to stop you from keeping the scene graph in a GPU-visible data structure and traversing it using a compute shader, for example. You can even use transform feedback to produce draws into multiple lists. For example, if you partition required render states into four discrete buckets (remember, using the techniques here, you can achieve a lot with a single state configuration) and tag each surface with its state bucket, then traversal and culling can occur in a vertex shader, with a geometry shader used to write the surfaces out to streams representing state buckets.

Once geometry, meshes and individual surfaces have been sorted into a small number of buckets with all of their parameters in separate buffers, we can issue a handful of calls to glMultiDrawElementsIndirect with the required state changes between and get our scene rendered.

Vertex Data Management

One issue with this technique is that all of the required vertex data must be available and ready to render for all of the indirect draws. If you’re using traditional vertex attributes, the vertex formats for the draws must be the same too. If you can live with that limitation, then great — your problems end here. However, if you do happen to have different formats of vertex data per draw, or different sets of vertex data, there are a couple of workarounds.

First, you could declare two or more versions of the input data in your vertex shader. OpenGL supports enough vertex attributes that it may be possible to maintain two or more sets of vertex attribute bindings. For any given draw, only one will contain meaningful data, and your vertex shader can choose, per-draw, which set to process and forward on. Of course, the downside to this approach is that you’ll end up fetching more data than necessary in your vertex shader.

The second option is to forgo traditional vertex attributes and use shader storage buffers instead. Your shader is supplied with gl_VertexID and gl_InstanceID, and with gl_BaseVertexARB and gl_BaseInstanceARB if you’re using the GL_ARB_shader_draw_parameters extension. Rather than using fixed-function vertex attributes, simply put all of your vertex data into buffer objects (interleaved) and fetch from them using shader storage blocks in the vertex shader. The vertex shader can choose, per-draw, which buffers to read from. Per-draw conditions are considered dynamically uniform, so branching performance shouldn’t be an issue. You won’t have access to some of the more obscure vertex formats without some bit-packing code in your vertex shader, but otherwise, everything should behave as normal. You can even fetch your vertex attributes using a shader subroutine.

Having said all this, your indices do need to be in one buffer for an entire sequence of draws handled by a call to glMultiDrawElementsIndirect.

One approach to achieve this is to simply allocate one or two huge buffers up-front, and then use your own memory allocator to allocate space in them. The allocator manages offsets into those buffers, and OpenGL only ever sees those two bindings. You probably only want to do this with static buffers (those that are read-only from the GPU’s point of view). However, even in the absence of the indirect draw features we’ve been discussing, this can seriously reduce the number of buffer objects switches in a frame, improving performance.

In Summary

The goal here is not necessarily to reduce our entire scene to a single draw command. While that would be nice, it’s not really feasible today. However, the real goal is to reduce the load on the CPU to negligible levels and to ensure that the bottleneck in our application is not state changes and drawing commands. Ideally, we’ll be limited by shader complexity, bandwidth requirements, vertex or geometry processing rates. A few hundred draw commands per frame, with a handful of state changes between each is considered pretty light from a CPU performance point of view. Now, if each of those drawing commands is a call to glMultiDrawElementsIndirect and each of those ends up producing thousands of individual draws then we’re going to approach our goal of a million draws per frame fairly rapidly.

Imagine a scenario where every leaf on a tree could be a separate, independent draw, or where every soldier in an army could be unique but rendered with a single OpenGL command. That is what we’re trying to achieve here and is a fairly attainable goal with a modest investment of effort. Future GPU and API features will allow us to push more and more of scene graph management, state change processing and high level decision making to the GPU, improving performance further.


  1. … which I’m not at liberty to discuss
OpenGL, Optimization

5 responses to The Road to One Million Draws

  1. Marc St-Jacques

    I just received my copy of the book and I’m really excited about it. I learned OpenGL with the first edition years ago. After dropping out of any graphics experimentation for many years, I’m glad to start over with a fresh book from a series that’s still high on quality yet not-too-high on the learning curve. Good work and good luck.

  2. Donatas

    could not found any way to contact you so am writing here.
    I have downloaded but I am unable to compile it. I get a lot of undefined reference to ‘glfw******’ functions and to ‘dl***’ functions.
    I have too. And after few bug fixes I was able to compile it. So I have all libreries I need.
    Ubuntu 13.10 64bit
    GeForce GTX 470
    NVIDIA 319.6

    • Donatas

      I was able to build all samples by modifing CMakeLists.txt file to deleting a lot of new lines. And was able to compile But I have found it is missing textures and objects folders. So most of examples do not work. By copy past this folders from older version of sb6code it works.

      • Thats correct. The source code and media files (textures + objects) are separate downloads. Unpack the source and then unpack the media archive into the media directory.

  3. Johan Kjölhede

    Just wanted to say thanks for this wonderful article :).
    Your tips helped me achieve >20M draws/s on the JVM! (in scala)

    This to me proves that the JVMOpenGL is fast enough for more demanding applications – now I can code my applications graphics with super naive code without any manual batching :) (The renderer auto-batches to achieve the performance). This also let me make my application’s graphics completely renderer agnostic, so I can swap it out for any other GL- or even non-GL backend!

    Also I recommend this for everyone to read: