OpenGL Origins

The history of OpenGL is quite well covered. This post is not about the past, however; it’s about the future. How do features make it into OpenGL? Who decides what the functions should be called? What goes in the core specification and what remains an extension? In this post, I’ll talk about the process at Khronos — the standards body that produces the OpenGL specification, its members, the process of creating and publishing and suggest how you, as a user, can contribute.

Khronos Group

The Khronos Group is an open consortium of over 100 companies that have an interest in graphics and compute APIs. Along with OpenGL and OpenGL ES, it also manages OpenCL, OpenVG, EGL, WebGL, COLLADA and number of other media, sensor and vision-centric standards. There are several tiers of membership. The highest level of membership is known as a Promoter member, of which there are currently 12. They are listed here, and are AMD, Apple, ARM, Epic Games, Imagination Technologies, Intel, Nokia, NVIDIA, Qualcomm, Samsung, Sony and Vivante. Promoter members hold a seat on the board of directors at Khronos. This board has final sign off on new specifications, and determines things such as funding for working groups and the general direction of Khronos itself.

The next level of membership is the Contributor member. The complete list of current contributors is available here. Contributor members participate in the working groups which define the standards and can vote on the features that might make it into those standards. There are many more Contributors than Promoters, but both types of member get a single working group vote. There is also a level of membership specifically for academic institutions (the Academic member), which is substantially less costly, but doesn’t come with any voting rights. However, academic members do get early access to draft specifications, are welcome to join conference calls and face-to-face meetings, and to provide feedback and advice.

In addition to marketing efforts and an online presence, Khronos organizes educational events, in-person conferences and social events. The Khronos members get together several times a year to discuss new technologies, progress specification development and build working relationships with each other.

Khronos Booth at SIGGRAPH 2013

Khronos Booth at SIGGRAPH 2013

OpenGL Features

OpenGL is overseen by the OpenGL working group at Khronos. This group is sometimes known as the ARB, or the Architecture Review Board, which is the name by which it was known when the specification was managed by SGI. Over time, hardware vendors (generally companies that implement OpenGL) ship new features as extensions. On occasion, they will work with each other or with software developers or other groups that don’t necessarily implement OpenGL to define those extensions and ensure they’re useful and well written. In addition to multi-vendor extensions, the hardware vendors may also ship each other’s extensions.

Roughly once a year (usually after a big specification update), the OpenGL working group gets together and puts existing extensions and new feature ideas together in a candidate list for the next version of OpenGL. In addition to shipping extensions from various implementations, these ideas include features that users, developers and commentators may have suggested. At this point in the process, the group straw polls its members to discover the level of motivation to push each feature idea forward, and to find volunteers to write the specification language and accompanying extension document and to be the owner for bug reports and other issues related to it.

Once volunteers have been identified to own particular features, each starts writing the specifications, creating function prototypes and so on. Each new feature is drafted as an extension at first. If the feature is entirely new, the extension document is written from scratch. If it is the promotion of an existing single- or multi-vendor extension, then the group gathers feedback on that version of the extension and updates are incorporated for the candidate for the OpenGL core. At this point, the features are still somewhat in flux and might get pulled if they’re not moving in the right direction.

Once the feature specifications are in reasonable shape, the specification editor (the person that maintains the actual OpenGL specification) will start making edits to the OpenGL specification itself, folding the new features into the core. Eventually, the extensions and features are fully fleshed out, folded into the main specifications and ready to go. At this point, the eligible members of the working group vote to submit the final specification (along with the associated extension documents) for membership review as a potential new version of the specification.

Once submitted, the members have a defined period of time to conduct an IP (Intellectual Property) review. This is a time period that all member companies have to review the specification, and to raise any objections to folding specific technologies into an open standard. Khronos has a strict reciprocal licensing agreement that states that if you don’t withdraw a patent you own from a Khronos specification as ratified, you’re not allowed to then turn around and sue any other member over any un-withdrawn patent you own for a conformant implementation of that specification. In return, all other members agree to not use their IP portfolio to sue you for implementing the specification either. This is what allows OpenGL to remain open and royalty free. During the IP review period, member companies may notify Khronos that they are withdrawing specific patents from the reciprocal license. If this happens, the other members review the patent claims and decide whether to include the feature or not. If there is concern, the feature gets pulled.

Once the specification has cleared IP review, the Khronos board of directors votes to ratify it, which makes it an official Khronos specification which is ready for release. For the past few years, this has happened in the run-up to the SIGGRAPH conference where the existence of the specification has been announced. However, Khronos has no official ties with SIGGRAPH, and may release a new specification at any time.

Contributing to OpenGL

So, how can you contribute to OpenGL? Well, if you’re an employee of an existing member company or institution, step up, find out who your Khronos representative is and find out if you can join in. If you’re an employee of a company that has an interest in graphics, compute and multimedia technology, that isn’t already a member, see if your company can join. The same goes if you’re affiliated with an academic institution — find out if your institution will join Khronos. If none of that applies to you, start contacting people. You can get in touch with Khronos on Twitter (@thekhronosgroup), file bugs on the public Khronos Bugzilla, visit the specification feedback forums at opengl.org, or get in touch with individual members. Most of the extension specifications have contact names at the top. If you have a question about a feature or a suggestion for improvement, mail the person responsible for that feature. You can also meet Khronos members and other people interested in the standards at various social events organized by Khronos. Upcoming Khronos events are listed at the Khronos events page, and include Meetups, Birds of a Feather events, and educational sessions known as DevU.

OpenGL is an open standard. It evolves though the contributions of member companies, academic researchers, software developers and interested individuals. The more that you contribute, the stronger it becomes. Get involved!

Khronos, OpenGL

Memory Bandwidth and Vertices

Memory bandwidth is a precious commodity. As far as graphics cards are concerned, this is the rate at which the GPU can transfer data to or from memory and is measured in bytes (or more likely, gigabytes) per second. The typical bandwidth of modern graphics hardware can range anywhere from 20 GB/s for integrated GPUs to over 300 GB/s for enthusiast products. However, with add-in boards, data must cross the PCI-Express bus (the connector that the board plugs into), and its bandwidth is typically around 6 GB/s. Memory bandwidth affects fill rate and texture rate, and these are often quoted as performance figures when rating GPUs. One thing that is regularly overlooked, though, is the bandwidth consumed by vertex data.

Vertex Rates

Most modern GPUs can process at least one vertex per clock cycle. GPUs are shipping from multiple vendors that can process two, three, four and even five vertices on each and every clock cycle. The core clock frequency of these high end GPUs hover at around the 1 GHz mark, which means that some of these beasts can process three or four billion vertices every second. Just to put that in perspective, if you had a single point represented by one vertex for every pixel on a 1080p display, you’d be able to fill it at almost 2000 frames per second. As vertex rates start getting this high, we need to consider the amount of memory bandwidth required to fill the inputs to the vertex shader.

Let’s assume a simple indexed draw (the kind produced by glDrawElements) with only a 3-element floating-point position vector per-vertex. With 32-bit indices, that’s 4 bytes for each index and 12 bytes for each position vector (3 elements of 4 bytes each), making 16 bytes per vertex. Assuming an entry-level GPU with a vertex rate of one vertex per-clock and a core clock of 800 MHz, the amount of memory bandwidth required works out to 12 GB/s (16 * 800 * 10^6). That’s twice the available PCI-Express bandwidth, almost half of a typical CPU’s system memory bandwidth, and likely a measurable percentage of our hypothetical entry-level GPU’s memory bandwidth. Strategies such as vertex reuse can reduce the burden somewhat, but it remains considerable.

High Memory Bandwidth

Now, let’s scale this up to something a little more substantial. Our new hypothetical GPU runs at 1 GHz and processes four vertices per clock cycle. Again, we’re using a 32-bit index, and three 32-bit components for our position vector. However, now we add a normal vector (another 3 floating-point values, or 12 bytes per vertex), a tangent vector (3 floats again), and a single texture coordinate (2 floats). In total, we have 48 bytes per vertex (4 + 12 + 12 + 12 + 8), and 4 billion vertices per-second (4 vertices per-clock at 1 GHz). That’s 128 GB/s of vertex data. That’s 20 times the PCI-express bandwidth, several times the typical CPU’s system memory bandwidth (the kind of rate you’d get from memcpy). Such a GPU might have a bandwidth of around 320 1 GB/s. That kind of vertex rate would consume more than 40% of the GPU’s total memory bandwidth.

Clearly, if we blast the graphics pipeline with enough raw vertex data, we’re going to be sacrificing a substantial proportion of our memory bandwidth to vertex data. So, what should we do about this?

Optimization

Well, first, we should evaluate whether such a ridiculous amount of vertex data is really necessary for our application. Again, if each vertex produces a single pixel point, 4 vertices per clock at 1 GHz is enough to fill a 1080p screen at 2000 frames per second. Even at 4K (3840 * 2160), that’s still enough single points to produce 480 frames per second. You don’t need that. Really, you don’t. However, if you decide that yes, actually, you do, then there are a few tricks you can use to mitigate this.

  • Use indexed triangles (either GL_TRIANGLE_STRIP or GL_TRIANGLES). If you do use strips, keep them as long as possible and try to avoid using restart indices, as this can hurts parallelization.
  • If you’re using independent triangles (not strips), try to order your vertices such that the same vertex is referenced several times in short succession if it is shared by more than one triangle. There are tools that will re-order vertices in your mesh to make better use of GPU caches.
  • Pick vertex data formats that make sense for you. Full 32-bit floating point is often overkill. 16-bit unsigned normalized texture coordinates will likely work great assuming all of your texture coordinates are between 0.0 and 1.0, for example. Packed data formats also work really well. You might choose GL_INT_2_10_10_10_REV for normal and tangent vectors, for example.
  • If your renderer has a depth only pass, disable any vertex attributes that don’t contribute to the vertices’ position.
  • Use features such as tessellation to amplify geometry, or techniques such as parallax occlusion mapping to give the appearance of higher geometric detail.

In the limit, you can forgo fixed-function vertex attributes, using only position, for example. Then, if you can determine that for a particular vertex you only need a subset of the vertex attributes, fetch them explicitly from shader storage buffers. In particular, if you are using tessellation or geometry shaders, you have access to an entire primitive in one shader stage. There, you can perform simple culling (project the bounding box of the primitive into screen space) using only the position of each vertex and then only fetch the rest of the vertex data for vertices that are part of a primitive that will contribute to the final scene.

Summary

Don’t underestimate the amount of pressure that vertex data can put on the memory subsystem of your GPU. Fat vertex attributes coupled with high geometric density can quickly eat up a double-digit percentage of your graphics card’s memory bandwidth. Dynamic vertex data that is modified regularly by the CPU can burn through available PCI bandwidth. Optimizing for vertex caches can be a big win with highly detailed geometry. While we often worry about shader complexity, fill rates, texture sizes and bandwidth, we often overlook the cost of vertex processing — bandwidth is not free.

OpenGL, Optimization

Vertex Array Performance

There have been some questions about the performance benefits of using vertex array objects (VAOs) recently. Vertex array objects are part of the core OpenGL specification and you must have one bound in order to draw anything. This change to the specification instituted quite a bit of complaining from fans of legacy OpenGL, and some of the experts in the field have even made claims that using vertex array objects is slower than just using traditional vertex attributes on all implementations. I was fairly certain that this wasn’t true, so I decided to construct a benchmark to test that assumption.
Read more »

OpenGL, Optimization , ,

Porting Samples to Mac

On October 22, 2013, Apple released OS X Mavericks, also known as OS X version 10.9. This version of the operating system included a long awaited update to the supported version of OpenGL. The 6th edition of the OpenGL SuperBible is about OpenGL version 4.3, and unfortunately, Apple’s latest and greatest only supports version 4.1 of the API. As OpenGL 4.1 was released on July 26, 2010, this puts OS X more than three years behind. However, not all of the book’s samples make use of all of the latest features, and it’s possible to run many of them on version 4.1 of the API. I’ve ported what I can.

Read more »

Mac, OpenGL , ,

Voodoo Registers – Part 3

In the previous post on hacking on the Voodoo Registers, I got as far as rendering our first triangle. It was a single, flat-shaded triangle and was pretty uninteresting. In this post we’ll look at how the Voodoo Graphics chipset handles linear interpolation, which is necessary for smooth shading, and ultimately texture mapping. We’ll also see the first signs of divergence between the operation of the Voodoo hardware and how the Glide API exposed it.
Read more »

Hardware, Retro , , ,

The Road to One Million Draws

Recently I posted this video on YouTube and tweeted the link:



This is a capture of the multidrawindirect sample from our example code package. It’s a pretty simple example that just puts a whole bunch of draws in one big buffer and blasts them at OpenGL with a single call to glMultiDrawArraysIndirect. The application achieves a rate of roughly 3.5 million independent draws per-second and is limited by the ability of the tested hardware (an AMD Radeon HD 7970) to process vertices. By making each draw much smaller, I’ve seen rates of consumption of 20 to 40 million draws per-second, depending on the underlying hardware. It seems that we’re not that far off from being able to push through one million draws per frame at a steady 60Hz.
Read more »

OpenGL, Optimization

Voodoo Registers – Part 2

In the first part of this series, I explained how to install the sstfb framebuffer device driver on Linux, query the base address of the 3Dfx Voodoo chipset and map it into a user-mode process. Then I set about writing directly to the Voodoo registers I’d mapped and got as far as issuing a single accelerated color buffer clear. In this post, we set about drawing our first triangle.
Read more »

Hardware, Retro

Voodoo Registers – Part 1

I was sucked into graphics programming by demoscene productions of the late eighties and early nineties. Back then it was all about software rendering, copper lists, mode X and other neat tricks. Later, I was exposed to hardware acceleration and OpenGL on high powered graphics workstations during my time at university. These things were massively expensive and I’d have to make trips to the graphics labs to test my programs. Owning a machine capable of accelerated 3D graphics was not realistic. Until, that is, 3Dfx started producing consumer grade hardware that slashed the cost of (basic) 3D hardware from tens of thousands of dollars to a few hundred. I rushed out and got my first ‘real’ graphics card — an Orchid Righteous 3D, which was based on the 3Dfx Voodoo 1 chipset. I still have it.
Read more »

Hardware, Retro , ,

Order Independent Transparency

This post is about order independent transparency, what it is and why it matters. Wikipedia has a stub on order independent transparency, which briefly explains what it is. Is order independent transparency necessary? Yes, sometimes, but it depends on the function used to simulate transparency effects. The most often used function is probably the one that is commonly known as alpha blending, where each pixel has an associated alpha value, which is used as its opacity. The problem is, order matters.

Read more »

OpenGL ,

OpenGL SuperBible is Shipping

It’s been a busy month! We had the run up to SIGGRAPH, then the conference itself. I saw the book in print for the first time on the SIGGRAPH bookstore shelf – front and center. Copies were shipped to the authors and reviewers during the week. Many pre-orders also went out in the last couple of weeks and many of you have received your copies. Hopefully, you’re enjoying the books so far. We’ve posted the example source code (here) which, admittedly should have been online a while back. If you haven’t got a copy yet, you can get it online at InformIT, Amazon and many other fine retailers. There are also e-book editions available.


OpenGL SuperBible 6th Edition

Announcements