Name ARB_compute_shader Name Strings GL_ARB_compute_shader Contact Graham Sellers, AMD (graham.sellers 'at' amd.com) Contributors Pat Brown, NVIDIA Daniel Koch, TransGaming John Kessenich Members of the ARB working group Notice Copyright (c) 2012-2014 The Khronos Group Inc. Copyright terms at http://www.khronos.org/registry/speccopyright.html Specification Update Policy Khronos-approved extension specifications are updated in response to issues and bugs prioritized by the Khronos OpenGL Working Group. For extensions which have been promoted to a core Specification, fixes will first appear in the latest version of that core Specification, and will eventually be backported to the extension document. This policy is described in more detail at https://www.khronos.org/registry/OpenGL/docs/update_policy.php Status Complete. Approved by the ARB on 2012/06/12. Version Last Modified Date: December 10, 2018 Revision: 28 Number ARB Extension #122 Dependencies OpenGL 4.2 is required. This extension is written based on the wording of the OpenGL 4.2 (Core Profile) specification, and on the wording of the OpenGL Shading Language (GLSL) Specification, version 4.20. This extension interacts with OpenGL 4.3 and ARB_shader_storage_buffer_object. This extension interacts with NV_vertex_buffer_unified_memory. Overview Recent graphics hardware has become extremely powerful and a strong desire to harness this power for work (both graphics and non-graphics) that does not fit the traditional graphics pipeline well has emerged. To address this, this extension adds a new single-stage program type known as a compute program. This program may contain one or more compute shaders which may be launched in a manner that is essentially stateless. This allows arbitrary workloads to be sent to the graphics hardware with minimal disturbance to the GL state machine. In most respects, a compute program is identical to a traditional OpenGL program object, with similar status, uniforms, and other such properties. It has access to many of the same resources as fragment and other shader types, such as textures, image variables, atomic counters, and so on. However, it has no predefined inputs nor any fixed-function outputs. It cannot be part of a pipeline and its visible side effects are through its actions on images and atomic counters. OpenCL is another solution for using graphics processors as generalized compute devices. This extension addresses a different need. For example, OpenCL is designed to be usable on a wide range of devices ranging from CPUs, GPUs, and DSPs through to FPGAs. While one could implement GL on these types of devices, the target here is clearly GPUs. Another difference is that OpenCL is more full featured and includes features such as multiple devices, asynchronous queues and strict IEEE semantics for floating point operations. This extension follows the semantics of OpenGL - implicitly synchronous, in-order operation with single-device, single queue logical architecture and somewhat more relaxed numerical precision requirements. Although not as feature rich, this extension offers several advantages for applications that can tolerate the omission of these features. Compute shaders are written in GLSL, for example and so code may be shared between compute and other shader types. Objects are created and owned by the same context as the rest of the GL, and therefore no interoperability API is required and objects may be freely used by both compute and graphics simultaneously without acquire-release semantics or object type translation. New Procedures and Functions void DispatchCompute(uint num_groups_x, uint num_groups_y, uint num_groups_z); void DispatchComputeIndirect(intptr indirect); New Tokens Accepted by the parameter of CreateShader and returned in the parameter by GetShaderiv: COMPUTE_SHADER 0x91B9 Accepted by the parameter of GetIntegerv, GetBooleanv, GetFloatv, GetDoublev and GetInteger64v: MAX_COMPUTE_UNIFORM_BLOCKS 0x91BB MAX_COMPUTE_TEXTURE_IMAGE_UNITS 0x91BC MAX_COMPUTE_IMAGE_UNIFORMS 0x91BD MAX_COMPUTE_SHARED_MEMORY_SIZE 0x8262 MAX_COMPUTE_UNIFORM_COMPONENTS 0x8263 MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS 0x8264 MAX_COMPUTE_ATOMIC_COUNTERS 0x8265 MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS 0x8266 MAX_COMPUTE_WORK_GROUP_INVOCATIONS 0x90EB Accepted by the parameter of GetIntegeri_v, GetBooleani_v, GetFloati_v, GetDoublei_v and GetInteger64i_v: MAX_COMPUTE_WORK_GROUP_COUNT 0x91BE MAX_COMPUTE_WORK_GROUP_SIZE 0x91BF Accepted by the parameter of GetProgramiv: COMPUTE_WORK_GROUP_SIZE 0x8267 Accepted by the parameter of GetActiveUniformBlockiv: UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER 0x90EC Accepted by the parameter of GetActiveAtomicCounterBufferiv: ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER 0x90ED Accepted by the parameters of BindBuffer, BufferData, BufferSubData, MapBuffer, UnmapBuffer, GetBufferSubData, and GetBufferPointerv: DISPATCH_INDIRECT_BUFFER 0x90EE Accepted by the parameter of GetIntegerv, GetBooleanv, GetInteger64v, GetFloatv, and GetDoublev: DISPATCH_INDIRECT_BUFFER_BINDING 0x90EF Accepted by the parameter of UseProgramStages: COMPUTE_SHADER_BIT 0x00000020 Additions to Chapter 2 of the OpenGL 4.2 (Core Profile) Specification (OpenGL Operation) In section 2.9.1, "Creating and Binding Buffer Objects", add to table 2.8 (p.43): Described Target name Purpose in sections(s) ----------------------- ------------------------- --------------- DISPATCH_INDIRECT_BUFFER Indirect compute dispatch 5.5 commands Add to the end of section 2.9.8, "Indirect Commands In Buffer Objects" (p. 53): Arguments to the DispatchComputeIndirect command are stored in buffer objects as a group of three unsigned integers. A buffer object is bound to DISPATCH_INDIRECT_BUFFER by calling BindBuffer with target set to DISPATCH_INDIRECT_BUFFER, and buffer set to the name of the buffer object. If no corresponding buffer object exists, one is initialized as defined in section 2.9. DispatchComputeIndirect sources its arguments from the buffer object whose name is bound to DISPATCH_INDIRECT_BUFFER, using the parameter as an offset into the buffer object in the same fashion as described in section 2.9.6. An INVALID_OPERATION error is generated if this command sources data beyond the end of the buffer object, if zero is bound to DISPATCH_INDIRECT_BUFFER, or if is less than zero or not a multiple of the size, in basic machine units, of uint. In section 2.11, "Vertex Shaders", modify the introductory text on shaders to include compute shaders (second paragraph, p. 56): In addition to vertex shaders, tessellation control..., geometry shaders, fragment shaders, and compute shders can be created, compiled, and linked into program objects. .... (section 3.10). Compute shaders perform general computations for dispatched arrays of shader invocations (section 5.5), but do not operate on primitives processed by the other shader types. ... In section 2.11.3, "Program Objects", add to the reasons that LinkProgram may fail, p. 61: * The program object contains objects to form a compute shader (see section 5.5) and objects to form any other type of shader. In section 2.11.3, modify the description of active programs (last paragraph, p. 61, first paragraph, p. 62): ... geometry shader stages, those stages are ignored. If there is no active program for the compute shader stage, compute dispatches will generate an error. The active program for the compute shader stage has no effect on the processing of vertices, geometric primitives, and fragments, and the active program for all other shader stages has no effect on compute dispatches. In section 2.11.4, "Program Pipeline Objects", modify the description of UseProgramStages, p. 65: The executables in a program object... becomes current. These stages may include vertex, tessellation control, tessellation evaluation, geometry, fragment, or compute, indicated by VERTEX_SHADER_BIT, TESS_CONTROL_SHADER_BIT, TESS_EVALUATION_SHADER_BIT, GEOMETRY_SHADER_BIT, FRAGMENT_SHADER_BIT, or COMPUTE_SHADER_BIT, respectively. ... In the unnumbered "Validation" section of section 2.11.12 "Shader Execution", modify the list of validation errors, pp. 112-113: This error is generated by any command that transfers vertices to the GL or launches compute work if: * (last bullet, p. 112) One program object is active... first program object was active. The active compute shader is ignored for the purposes of this test. * (2nd bullet, p. 113) There is no current program specified by UseProgram, there is a current program pipeline object, and the current program for any shader stage has been relinked since... * (3rd bullet, p. 113) Any two active samplers in the set of active program objects are of different types but refer to the same texture image unit. * (4th bullet, p. 113) The sum of the number of active samplers for each active program exceeds the maximum number of texture image units allowed. Modify the paragraph describing ValidateProgram, p. 113: ... If validation succeeded, ... set to FALSE. If validation succeeded, no INVALID_OPERATION validation error will be generated if were made current via UseProgram, given the current state. If validation failed, such errors will be generated under the current state. Modify the paragraph describing ValidateProgramPipeline, p. 114: ... can be queried with GetProgramPipelineiv (see section 6.1.12). If validation succeeded, no INVALID_OPERATION validation error will be generated if were bound and no program were made current via UseProgram, given the current state. If validation failed, such errors will be generated under the current state. In subsection 2.11.12, "Shader Execution": Add to the list of implementation dependent constants under the "Texture Access" sub-heading: MAX_COMPUTE_TEXTURE_IMAGE_UNITS (for compute shaders), Add to the list of implementation dependent constants under the "Atomic Counter Access" sub-heading: MAX_COMPUTE_ATOMIC_COUNTERS (for compute shaders), Add to the list of implementation dependent constants under the "Image Access" sub-heading: MAX_COMPUTE_IMAGE_UNIFORMS (for compute shaders), In section 2.16, "Conditional Rendering", modify the sentence describing conditional rendering, starting with "In this case"... In this case, all drawing commands (see section 2.8.3), as well as Clear and ClearBuffer* (see section 4.2.3), and compute dispatch through DispacthCompute* (see section 5.5), have no effect. In the "Shared Memory Access Synchronization" subsection of section 2.11.13, "Shader Memory Access", modify the description of COMMAND_BARRIER_BIT (p. 118): * COMMAND_BARRIER_BIT: Command data sourced from buffer objects by Draw*Indirect and DispatchComputeIndirect commands ... The buffer objects affected by this bit are derived from the DRAW_INDIRECT_BUFFER and DISPATCH_INDIRECT_BUFFER bindings. In subection 2.17.7, "Uniform Variables", replace the paragraph beginning "If is UNIFORM_BLOCK_REFERENCED_BY_VERTEX_SHADER,"... with: If is UNIFORM_BLOCK_REFERENCED_BY_VERTEX_SHADER, UNIFORM_BLOCK_REFERENCED_BY_TESS_CONTROL_SHADER, UNIFORM_BLOCK_REFERENCED_BY_TESS_EVALUATION_SHADER, UNIFORM_BLOCK_REFERENCED_BY_GEOMETRY_SHADER, UNIFORM_BLOCK_REFERENCED_BY_FRAGMENT_SHADER or UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER, then a boolean value indicating whether the uniform block identified by uniformBlockIndex is referenced by the vertex, tessellation control, tessellation evaluation, geometry, fragment or compute programming stages of , respectively, is returned. Also in subsection 2.17.7, "Uniform Variables", replace the paragraph beginning, "If is ATOMIC_COUNTER_BUFFER_REFERENCED_BY_VERTEX_SHADER" on p.80 with: If is ATOMIC_COUNTER_BUFFER_REFERENCED_BY_VERTEX_SHADER, ATOMIC_COUNTER_BUFFER_REFERENCED_BY_TESS_CONTROL_SHADER, ATOMIC_COUNTER_BUFFER_REFERENCED_BY_TESS_EVALUATION_SHADER, ATOMIC_COUNTER_BUFFER_REFERENCED_BY_GEOMETRY_SHADER, ATOMIC_COUNTER_BUFFER_REFERENCED_BY_FRAGMENT_SHADER or ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER, then a single boolean value indicating whether the atomic counter buffer identified by bufferIndex is referenced by the vertex, tessellation control, tessellation evaluation, geometry, fragment or compute programming stages of , respectively, is returned. Under the sub-heading "Uniform Blocks" in subsection 2.11.17, replace the sentence beginning "The limits for vertex, tessellation ..." on p.92 with: The limits for vertex, tessellation, geometry, fragment and compute shaders can be obtained by calling GetIntegerv with set to MAX_VERTEX_UNIFORM_BLOCKS, MAX_TESS_CONTROL_UNIFORM_BLOCKS, MAX_TESS_EVALUATION_UNIFORM_BLOCKS, MAX_GEOMETRY_UNIFORM_BLOCKS, MAX_FRAGMENT_UNIFORM_BLOCKS and MAX_COMPUTE_UNIFORM_BLOCKS, respectively. Under the sub-heading "Atomic Counter Buffers" in subsection 2.11.17, replace the sentence beginning "The limits for vertex, geometry, ..." on p.96 with: The limits for vertex, tessellation, geometry, fragment and compute shaders can be obtained by calling GetIntegerv with set to MAX_VERTEX_ATOMIC_COUNTER_BUFFERS, MAX_TESS_CONTROL_ATOMIC_COUNTER_BUFFERS, MAX_TESS_EVALUATION_ATOMIC_COUNTER_BUFFERS, MAX_GEOMETRY_ATOMIC_COUNTER_BUFFERS, MAX_FRAGMENT_ATOMIC_COUNTER_BUFFERS and MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS, respectively. Additions to Chapter 3 of the OpenGL 4.2 (Core Profile) Specification (Rasterization) None. Additions to Chapter 4 of the OpenGL 4.2 (Core Profile) Specification (Per-Fragment Operations and the Framebuffer) None. Additions to Chapter 5 of the OpenGL 4.2 (Core Profile) Specification (Special Functions) Add Section 5.5, "Compute Shaders" In addition to graphics-oriented shading operations such as vertex, tessellation, geometry and fragment shading, generic computation may be performed by the GL through the use of compute shaders. The compute pipeline is a form of single-stage machine that runs generic shaders. Compute shaders are created as described in section 2.11.1 using a parameter of COMPUTE_SHADER. They are attached to and used in program objects as described in section 2.11.3. Compute workloads are formed from groups of work items called _workgroups_ and processed by the executable code for a compute program. A workgroup is a collection of shader invocations that execute the same code, potentially in parallel. An invocation within a workgroup may share data with other members of the same workgroup through shared variables and issue memory and control barriers to synchronize with other members of the same workgroup. One or more workgroups is launched by calling: void DispatchCompute(uint num_groups_x, uint num_groups_y, uint num_groups_z); Each workgroup is processed by the active program object for the compute shader stage. The error INVALID_OPERATION will be generated if there is no active program object for the compute shader stage. The active program for the compute shader stage will be determined in the same manner as the active program for other pipeline stages, as described in section 2.11.3. While the individual shader invocations within a workgroup are executed as a unit, workgroups are executed completely independently and in unspecified order. , and specify the number of workgroups that will be dispatched in the X, Y and Z dimensions, respectively. The builtin vector variable gl_NumWorkGroups will be initialized with the contents of the , and parameters. The maximum number of workgroups that may be dispatched at one time may be determined by calling GetIntegeri_v with set to MAX_COMPUTE_WORK_GROUP_COUNT and must be zero, one, or two, representing the X, Y, and Z dimensions, respectively. The values in the , and array must be less than or equal to the maximum workgroup count for the corresponding dimension, otherwise an INVALID_VALUE error is generated. If the workgroup count in any dimension is zero, no workgroups are dispatched. The workgroup size in each dimension are specified at compile time using an input layout qualifier in one or more of the compute shaders attached to the program (see Section 4 of the OpenGL Shading Language Specification). After the program has been linked, the workgroup size of the program may be retrieved by calling GetProgramiv with set to COMPUTE_WORK_GROUP_SIZE. This will return an array of three integers containing the workgroup size of the compute program as specified by its input layout qualifier(s). If is the name of a program that has not been successfully linked, or is the name of a linked program object that contains no compute shaders, then an INVALID_OPERATION error is generated. The maximum size of a workgroup may be determined by calling GetIntegeri_v with set to MAX_COMPUTE_WORK_GROUP_SIZE and set to 0, 1, or 2 to retrieve the maximum work size in the X, Y and Z dimension, respectively. Furthermore, the maximum number of invocations in a single workgroup (i.e., the product of the three dimensions) may be determined by calling GetIntegerv with set to MAX_COMPUTE_WORK_GROUP_INVOCATIONS. The command void DispatchComputeIndirect(intptr indirect); is equivalent (assuming no errors are generated) to calling DispatchCompute with , and initialized with the three uint values contained in the buffer currently bound to the DISPATCH_INDIRECT_BUFFER binding at an offset, in basic machine units, specified by . The error INVALID_VALUE is generated if is less than zero or is not a multiple of four. The error INVALID_OPERATION is generated if no buffer is bound to DISPATCH_INDIRECT_BUFFER, if the command would source data beyond the end of the buffer object, or if there is no active program for the compute shader stage. If any of , or is greater than MAX_COMPUTE_WORK_GROUP_COUNT for the corresponding dimension then the results are undefined. Add Subsection 5.5.1, "Compute Shader Variables" Compute shaders can access variables belonging to the current program object. The amount of storage in the default uniform block accessed by a compute shader is specified by the value of the implementation dependent constant MAX_COMPUTE_UNIFORM_COMPONENTS. The total amount of combined storage available for uniform variables in all uniform blocks accessed by a compute shader (including the default unifom block) is specified by the implementation dependent constant MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS. There is a limit to the total size of all variables declared as in a single program object. This limit, expressed in units of basic machine units, may be queried as the value of MAX_COMPUTE_SHARED_MEMORY_SIZE. Additions to Chapter 6 of the OpenGL 4.2 (Core Profile) Specification (State and State Requests) None. Additions to Chapter 2 of the OpenGL Shading Language Specification, Version 4.20 (Overview of OpenGL Shading) Replace the last sentence of the first paragraph of the overview with the following: "Currently, these processors are the vertex, tessellation control, tessellation evaluation, geometry, fragment, and compute processors." Replace the last sentence of the second paragraph of the overview with the following: "The specific languages will be referred to by the name of the processor they target: vertex, tessellation control, tessellation evaluation, geometry, fragment, or compute." Add a new Section 2.6 titled "Compute Processor" with the following text: "The is a programmable unit that operates independently from the other shader processors. Compilation units written in the OpenGL Shading Language to run on this processor are called . When a complete set of compute shaders are compiled and linked, they result in a that runs on the compute processor. A compute shader has access to many of the same resources as fragment and other shader processors, such as textures, buffers, image variables, atomic counters, and so on. It does not have any predefined inputs nor any fixed-function outputs. It is not part of the graphics pipeline and its visible side effects are through actions on images, storage buffers, and atomic counters. A compute shader operates on a group of work items called a workgroup. A workgroup is a collection of shader invocations that execute the same code, potentially in parallel. An invocation within a workgroup may share data with other members of the same workgroup through shared variables and issue memory and control barriers to synchronize with other members of the same workgroup." Additions to Chapter 4 of the OpenGL Shading Language Specification, Version 4.20 (Variables and Types) Modify section 4.4.1, second paragraph from "All shaders allow input layout qualifiers on input variable declarations." to "All shaders, except compute shaders, allow input layout location qualifiers on input variable declarations." Modify Section 4.3. Add to the table at the start of Section 4.3: +-------------------+-----------------------------------------------------------+ | Storage Qualifier | Meaning | +-------------------+-----------------------------------------------------------+ | | variable storage is shared across all work items in a | | | workgroup for compute shaders | +-------------------+-----------------------------------------------------------+ Add the following paragraph to Section 4.3.4, "Input Variables" Compute shaders do not permit user-defined input variables and do not form a formal interface with any other shader stage. See section 7.1 for a description of built-in compute shader input variables. All other input to a compute shader is retrieved explicitly through image loads, texture fetches, loads from uniforms or uniform buffers, or other user supplied code. Redeclaration of built-in input variables in compute shaders is not permitted. Add the following paragraph to Section 4.3.6, "Output Variables" Compute shaders have no built-in output variables, do not support user-defined output variables and do not form a formal interface with any other shader stage. All outputs from a compute shader take the form of the side effects such as image stores and operations on atomic counters. Add Section 4.3.7, "Shared", renumber subsequent sections The qualifier is used to declare variables that have storage shared between all work items of a compute shader workgroup. Variables declared as may only be used in compute shaders (see Section 5.5, "Compute Shaders"). Shared variables are implicitly coherent. That is, writes to shared variables from one shader invocation will eventually be seen by other invocations within the same workgroup. Variables declared as may not have initializers and their contents are undefined at the beginning of shader execution. Any data written to variables will be visible to other shaders executing the same shader within the same workgroup. Order of execution with regards to reads and writes to the same variables by different invocations of a shader is not defined. In order to achieve ordering with respect to reads and writes to variables, memory barriers must be employed using the barrier() function (see Section 8.15). There is a limit to the total size of all variables declared as in a single program object. This limit, expressed in units of basic machine units may be determined by using the OpenGL API to query the value of MAX_COMPUTE_SHARED_MEMORY_SIZE. Add Section 4.4.1.4, "Compute-Shader Inputs" There are no layout location qualifiers for compute shader inputs. Layout qualifier identifiers for compute shader inputs are the workgroup size qualifiers: layout-qualifier-id local_size_x = integer-constant local_size_y = integer-constant local_size_z = integer-constant , , and are used to define the local size of the kernel defined by the compute shader in the first, second, and third dimension, respectively. The default size in each dimension is 1. If a shader does not specify a size for one of the dimensions, that dimension will have a size of 1. For example, the following declaration in a compute shader layout (local_size_x = 32, local_size_y = 32) in; is used to declare a two-dimensional compute shader with a local size of 32 x 32 elements as a three-dimensional compute shader where the third dimension is one element deep. As another example, the declaration layout (local_size_x = 8) in; effectively specifies that a one-dimensional compute shader is being compiled, and its size is 8 elements. If the local size of the shader in any dimension is greater than the maximum size supported by the implementation for that dimension, a compile-time error results. Also, if such a layout qualifier is declared more than once in the same shader, all those declarations must indicate the same workgroup size; otherwise a compile-time error results. If multiple compute shaders attached to a single program object declare the workgroup size, the declarations must be identical; otherwise a link-time error results. Furthermore, if a program object contains any compute shaders, at least one must contain an input layout qualifier specifying the workgroup sizes of the program, or a link-time error will occur. Additions to Chapter 7 of the OpenGL Shading Language Specification, Version 4.20 (Built-in Variables) Add to the start of Section 7.1, "Built-In Language Variables", before the description of the vertex language built-in variables: In the compute language, the built-in variables are declared as follows: // workgroup dimensions in uvec3 gl_NumWorkGroups; const uvec3 gl_WorkGroupSize; // workgroup and invocation IDs in uvec3 gl_WorkGroupID; in uvec3 gl_LocalInvocationID; // derived variables in uvec3 gl_GlobalInvocationID; in uint gl_LocalInvocationIndex; Add the end of Section 7.1, before Section 7.1.1: The built-in variable is a compute-shader input variable containing the total number of global work items in each dimension of the workgroup that will execute the compute shader. Its content is equal to the values specified in the , , and parameters passed to the DispatchCompute API entry point. The built-in constant is a compute-shader constant containing the workgroup size of the shader. The size of the workgroup in the X, Y, and Z dimensions is stored in the x, y, and z components. The values stored in match those specified in the required , , and layout qualifiers for the current shader. This value is constant so that it can be used to size arrays of memory that can be shared within the workgroup. The built-in variable is a compute-shader input variable containing the 3-dimensional index of the global workgroup that the current invocation is executing in. The possible values range across the parameters passed into DispatchCompute, i.e., from (0, 0, 0) to (gl_NumWorkGroups.x - 1, gl_NumWorkGroups.y - 1, gl_NumWorkGroups.z - 1). The built-in variable is a compute-shader input variable containing the 3-dimensional index of the workgroup within the global workgroup that the current invocation is executing in. The possible values for this variable range across the workgroup size, i.e. (0,0,0) to (gl_WorkGroupSize.x - 1, gl_WorkGroupSize.y - 1, gl_WorkGroupSize.z - 1). The built-in variable is a compute shader input variable containing the global index of the current work item. This value uniquely identifies this invocation from all other invocations across all workgroups initiated by the current DispatchCompute call. This is computed as: gl_GlobalInvocationID = gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID. The built-in variable is a compute shader input variable that contains the 1-dimensional representation of the gl_LocalInvocationID. This is useful for uniquely identifying a unique region of shared memory within the workgroup for this invocation to use. This is computed as: gl_LocalInvocationIndex = gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y + gl_LocalInvocationID.y * gl_WorkGroupSize.x + gl_LocalInvocationID.x; Add to the list of built-in constants in Section 7.3: const ivec3 gl_MaxComputeWorkGroupCount = { 65535, 65535, 65535 }; const ivec3 gl_MaxComputeWorkGroupSize = { 1024, 1024, 64 }; const int gl_MaxComputeUniformComponents = 512; const int gl_MaxComputeTextureImageUnits = 16; const int gl_MaxComputeImageUniforms = 8; const int gl_MaxComputeAtomicCounters = 8; const int gl_MaxComputeAtomicCounterBuffers = 1; Additions to Chapter 8 of the OpenGL Shading Language Specification, Version 4.20 (Built-in Variables) Insert "Atomic Memory Functions" section after Section 8.10, Atomic Counter Functions (p. 149). Atomic memory operations are supported on shared variables; the set of operations and their definitions are similar to those for the imageAtomic*() functions. These functions are fully documented in the ARB_shader_storage_buffer_object extension (see dependencies). Modify the first paragraph of Section 8.15, "Shader Invocation Control Functions" to read: The shader invocation control function is only available in tessellation control shaders and compute shaders. It is used to control the relative execution order of multiple shader invocations used to process a patch (in the case of tessellation control shaders) or a workgroup (in the case of compute shaders), which are otherwise executed with an undefined order. +----------------+--------------------------------------------------------------------------+ | Syntax | Description | +----------------+--------------------------------------------------------------------------+ | barrier | For any given static instance of barrier() appearing in a tessellation | | | control shader or compute shader, all invocations for a single patch | | | or workgroup, respectively, must enter it before any will continue | | | beyond it. | +----------------+--------------------------------------------------------------------------+ Modify the second paragraph as follows: ... Because invocations may execute in an undefined order between these barrier calls, the values of a per-vertex or per-patch output variable in a tessellation control shader or shared variables for compute shaders will be undefined in a number of cases enumerated in Section 4.3.7 "Output Variables" (for tessellation control shaders) and Section 4.3.6 "Shared Variables" (for compute shaders). Replace the third paragraph with the following: For tessellation control shaders, the barrier() function may only be placed inside the function main() of the tessellation control shader and may not be called within any control flow. Barriers are also disallowed after a return statement in the function main(). Any such misplaced barriers result in a compile-time error. For compute shaders, the barrier() function may be placed within flow control, but that flow control must be uniform flow control. That is, all the controlling expressions that lead to execution of the barrier must be dynamically uniform expressions. This ensures that if any shader invocation enters a conditional statement, then all invocations will enter it. While compilers are encouraged to give warnings if they can detect this might not happen, compilers cannot completely determine this. Hence, it is the author's responsibility to ensure barrier() only exists inside uniform flow control. Otherwise, some shader invocations will stall indefinitely, waiting for a barrier that is never reached by other invocations. Modify the table of memory control functions on p.160, +-----------------------------------+----------------------------------------------------------------------------------------+ | Syntax | Description | +-----------------------------------+----------------------------------------------------------------------------------------+ | void memoryBarrier() | Control the ordering of all memory transactions issued by a single shader invocation. | +-----------------------------------+----------------------------------------------------------------------------------------+ | void memoryBarrierAtomicCounter() | Control the ordering of accesses to atomic counter variables issued by a single shader | | | invocation. | +-----------------------------------+----------------------------------------------------------------------------------------+ | void memoryBarrierBuffer() | Control the ordering of memory transactions to buffer variables issued within a | | | single shader invocation. | +-----------------------------------+----------------------------------------------------------------------------------------+ | void memoryBarrierImage() | Control the ordering of memory transactions to images issued within a single shader | | | invocation. | +-----------------------------------+----------------------------------------------------------------------------------------+ | void memoryBarrierShared() | Control the ordering of memory transactions to shared variables issued within a single | | | shader invocation. | | | Only available in compute shaders. | +-----------------------------------+----------------------------------------------------------------------------------------+ | void groupMemoryBarrier() | Control the ordering of all memory transactions issued within a single shader | | | invocation, as viewed by other invocations in the same workgroup. | | | Only available in compute shaders. | +-----------------------------------+----------------------------------------------------------------------------------------+ Modify the subsequent paragraph as follows: The memory barrier built-in functions can be used to order reads and writes to variables stored in memory accessible to other shader invocations. When called, these functions will wait for the completion of all reads and writes previously performed by the caller that access selected variable types, and then return with no other effect. The built-in functions memoryBarrierAtomicCounter(), memoryBarrierBuffer(), memoryBarrierImage(), and memoryBarrierShared() wait for the completion of accesses to atomic counter, buffer, image, and shared variables, respectively. The built-in functions memoryBarrier() and groupMemoryBarrier() wait for the completion of accesses to all of the above variable types. The functions memoryBarrierShared() and groupMemoryBarrier() are available only in compute shaders; the other functions are available in all shader types. When these functions return, any memory stores performed using coherent variables prior to the call will be visible to any future coherent access to the same memory performed by any other shader invocation. In particular, the values written this way in one shader stage are guaranteed to be visible to coherent memory accesses performed by shader invocations in subsequent stages when those invocations were triggered by the execution of the original shader invocation (e.g., fragment shader invocations for a primitive resulting from a particular geometry shader invocation). Additionally, memory barrier functions order stores performed by the calling invocation, as observed by other shader invocations. Without memory barriers, if one shader invocation performs two stores to coherent variables, a second shader invocation might see the values written by the second store prior to seeing those written by the first. However, if the first shader invocation calls a memory barrier function between the two stores, selected other shader invocations will never see the results of the second store before seeing those of the first. When using the function groupMemoryBarrier(), this ordering guarantee applies only to other shader invocations in the same compute shader workgroup; all other memory barrier functions provide the guarantee to all other shader invocations. No memory barrier is required to guarantee the order of memory stores as observed by the invocation performing the stores; an invocation reading from a variable that it previously wrote will always see the most recently written value unless another shader invocation also wrote to the same memory. Dependencies on OpenGL 4.3 and ARB_shader_storage_buffer_object If OpenGL 4.3 and ARB_shader_storage_buffer_object are not supported, the spec language adding the built-in functions atomicAdd(), atomicMin(), atomicMax(), atomicAnd(), atomicOr(), atomicXor(), atomicExchange(), and atomicCompSwap() should be considered to be incorporated into this extension as-is, except that buffer variables will not be supported and thus cannot be used with these functions. No "#extension" directive is necessary to use these functions in compute shaders. If OpenGL 4.3 and ARB_shader_storage_buffer_object are not supported, references to the GLSL built-in function memoryBarrierBuffer() should be removed. Dependencies on NV_vertex_buffer_unified_memory If NV_vertex_buffer_unified_memory is supported, a new buffer address range and enable is provided to permit the use with DispatchComputeIndirect with a resident buffer object without requiring that it be bound to the DISPATCH_INDIRECT_BUFFER target. The following additional edits apply: Accepted by the parameter of GetBufferParameterui64vNV: DISPATCH_INDIRECT_BUFFER (defined above) Accepted by the parameter of Disable, Enable, and IsEnabled, and by the parameter of GetIntegerv, GetBooleanv, GetFloatv, GetDoublev and GetInteger64v: DISPATCH_INDIRECT_UNIFIED_NV 0x90FD Accepted by the parameter of BufferAddressRangeNV and the parameter of GetIntegerui64vNV: DISPATCH_INDIRECT_ADDRESS_NV 0x90FE Accepted by the parameter of GetIntegerv: DISPATCH_INDIRECT_LENGTH_NV 0x90FF Add to the end of Section 5.5, after discussion of DispatchComputeIndirect: If DISPATCH_INDIRECT_UNIFIED_NV is enabled, DispatchComputeIndirect does not use the buffer bound to DISPATCH_INDIRECT_BUFFER. Instead, it sources its arguments from the GPU address range specified by calling BufferAddressRangeNV with a of DISPATCH_INDIRECT_ADDRESS_NV and an of zero. The address is obtained by adding the parameter to the base address of the range, specified by the
parameter of BufferAddressRangeNV. If the command sources data outside the specified address range, the error INVALID_OPERATION will be generated. The DISPATCH_INDIRECT_BUFFER binding will be ignored in this case, and no errors will be generated due to the use of this binding. The error INVALID_VALUE will still be generated if is negative. No INVALID_VALUE error will be generated if is not a multiple of four, but INVALID_OPERATION will be generated if the effective address is not a multiple of four. If the indirect dispatch address range does not belong to a buffer object that is resident at the time of the DispatchComputeIndirect call, undefined results, possibly including program termination, may occur. Add the following to the "Compute Dispatch State" table defined in this extension: Get Value Type Get Command Initial Value Sec Attribute --------- ---- ----------- ------------- --- --------- DISPATCH_INDIRECT_UNIFIED_NV B IsEnabled FALSE 5.5 none DISPATCH_INDIRECT_ADDRESS_NV Z64+ GetIntegerui64vNV 0 5.5 none DISPATCH_INDIRECT_LENGTH_NV Z+ GetIntegerv 0 5.5 none Errors INVALID_OPERATION is generated by DispatchCompute or DispatchComputeIndirect if there is no active program for the compute shader stage. INVALID_VALUE is generated by DispatchCompute if any of , or is greater than the value of MAX_COMPUTE_WORK_GROUP_COUNT for the corresponding dimension. INVALID_VALUE is generated by DispatchComputeIndirect if is less than zero or not a multiple of four. INVALID_OPERATION is generated by DispatchComputeIndirect if no buffer is bound to DISPATCH_INDIRECT_BUFFER or if the command would source data beyond the end of the bound buffer object. INVALID_OPERATION is generated by GetProgramiv is is COMPUTE_WORK_GROUP_SIZE and either the program has not been linked successfully, or has been linked but contains no compute shaders. LinkProgram will fail if contains a combination of compute and non-compute shaders. New State None. New Implementation Dependent State Add to Table 6.31, "Program Pipeline Object State" +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | Get Value | Type | Get Command | Initial Value | Description | Sec. | +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | COMPUTE_SHADER | Z+ | GetProgramPipelineiv | 0 | Name of current compute shader project object | 2.11.4 | +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ Add to Table 6.32, "Program Object State" +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | Get Value | Type | Get Command | Initial Value | Description | Sec. | +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | COMPUTE_WORK_GROUP_SIZE | 3 x Z+ | GetProgramiv | { 0, ... } | Workgroup size of a linked compute program | 5.5 | | UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER | B | GetActiveUniformBlockiv | FALSE | True if uniform block is referenced by the compute stage | 2.17.7 | | ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER | B | GetActiveAtomicCounter- | FALSE | AACB has a counter used by compute shaders | 2.17.7 | | | | Bufferiv | FALSE | | | +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ Insert new table named "Compute Dispatch State", after Table 6.46 "Hints": +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | Get Value | Type | Get Command | Initial Value | Description | Sec. | +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ | DISPATCH_INDIRECT_BUFFER_BINDING | Z+ | GetIntegerv | 0 | Indirect dispatch buffer binding | 5.5 | +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ Insert Table 6.50, "Implementation Dependent Compute Shader Limits", renumber subsequent tables. +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+ | Get Value | Type | Get Command | Minimum Value | Description | Sec. | +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+ | MAX_COMPUTE_WORK_GROUP_COUNT | 3 x Z+ | GetIntegeri_v | 65535 | Maximum number of workgroups that may be dispatched by a single | 5.5 | | | | | | dispatch command (per dimension) | | | MAX_COMPUTE_WORK_GROUP_SIZE | 3 x Z+ | GetIntegeri_v | 1024 (x, y), 64 (z) | Maximum local size of a compute workgroup (per dimension) | 5.5 | | MAX_COMPUTE_WORK_GROUP_INVOCATIONS | Z+ | GetIntegerv | 1024 | Maximum total compute shader invocations in a single workgroup | 5.5 | | MAX_COMPUTE_UNIFORM_BLOCKS | Z+ | GetIntegerv | 12 | Maximum number of uniform blocks per compute program | 2.11.7 | | MAX_COMPUTE_TEXTURE_IMAGE_UNITS | Z+ | GetIntegerv | 16 | Maximum number of texture image units accessible by a compute shader | 2.11.12 | | MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS | Z+ | GetIntegerv | 8 | Number of atomic counter buffers accessed by a compute shader | 2.11.17 | | MAX_COMPUTE_ATOMIC_COUNTERS | Z+ | GetIntegerv | 8 | Number of atomic counters accessed by a compute shader | 2.11.12 | | MAX_COMPUTE_SHARED_MEMORY_SIZE | Z+ | GetIntegerv | 32768 | Maximum total storage size of all variables declared as in | | | | | | | all compute shaders linked into a single program object | | | MAX_COMPUTE_UNIFORM_COMPONENTS | Z+ | GetIntegerv | 512 | Number of components for compute shader uniform variables | 5.5.1 | | MAX_COMPUTE_IMAGE_UNIFORMS | Z+ | GetIntegerv | 8 | Number of image variables in compute shaders | 2.11.12 | | MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS | Z+ | GetIntegerv | * | Number of words for compute shader uniform variables in all uniform | 5.5.1 | | | | | | blocks, including the default | | +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+ Modify Table 6.55, increasing the following minimum values: MAX_COMBINED_TEXTURE_IMAGE_UNITS 96 (6*16), was 80 MAX_UNIFORM_BUFFER_BINDINGS 72 (6*12), was 60 Issues 1) Should variables be usable only in compute shaders, or in other stages too? RESOLVED: Support only in compute shaders. While some hardware may be able to support shared variables in shader stages other than compute, it is difficult to clearly define what the semantics are as far as sharing. For example, what is the equivalent for a workgroup for vertex shaders? 2) Can we expose atomics on variables? RESOLVED: Yes. The existing atomics in OpenGL 4.2 (via image variables) don't map well to the declaration. Instead, we've defined new atomic functions that take a variable as a first input. These functions are specified in the ARB_shader_storage_buffer_object extension and are incorporated into this extension via the interaction described above. We could have also chosen to define operators +=, &=, etc. to be atomic when applied to variables, but shaders may want to use such variables in cases where atomic access (and the related overhead) is not required. 3) Should the local size and dimensions of the workgroup be specified at compile time? What are the default local dimensions? RESOLVED: Dimension is always 3 and a workgroup size declaration is compulsory at compile time. There is no default. The value used is queriable. To use a 1- or 2-dimensional workgroup, the extra dimension(s) can be set to 1. 4) Do we need the local_work_size parameter in dispatch if the local size may be specified at compile time in the shader? RESOLVED: The specification of the workgroup size is now mandatory in the shader source at compile time and the local_work_size may no longer be specified at dispatch time. 5) How do multiple shaders attached to a single program object work? RESOLVED: Just as with any other shader stage. Exactly one of the shaders must provide the 'main' entry point. All shaders attached to a program object effectively get compiled into a single, large program at link time. The program is dispatched as one big entity. Über shader type functionality can be achieved through the use of subroutine uniforms, which also work exactly as for other shader stages. 6) Should compute dispatch honor conditional rendering? RESOLVED: Yes, it does honor conditional rendering. 7) Is it possible to pass compute programs to UseProgram, etc.? RESOLVED: Yes, compute programs can be made current via UseProgram and can be made current in a program pipeline object via UseProgramStages. Note that a compute program must be linked with PROGRAM_SEPARABLE set to TRUE to be passed to UseProgramStages, even though the compute pipeline has only a single shader stage. The active compute program that will be used by DispatchCompute will be determined in the same manner as the active program for any other program stage: * If there is a current program specified via UseProgram, that program is considered current for all stages, including compute. * Otherwise, if there is a current program pipeline object, the program current for the compute stage of the pipeline object is considered current for the compute stage. * If neither of the former apply, no program is current for the compute stage. The program that is current for the compute stage is considered to be active if and only if it has a compute shader executable. For example, if a non-compute program is made current via UseProgram, it will also be considered "current" for the compute stage, but won't be considered active. When using program pipeline objects, it's possible to switch between graphics and compute work without switching programs. For example, in: glBindProgramPipeline(pipeline); glUseProgramStages(pipeline, GL_VERTEX_SHADER_BIT, programA); glUseProgramStages(pipeline, GL_FRAGMENT_SHADER_BIT, programB); glUseProgramStages(pipeline, GL_COMPUTE_SHADER_BIT, programC); glDrawArrays(GL_TRIANGLES, 0, 900); glDispatchCompute(5, 5, 5); the triangles will be processed by programA and programB, while the compute dispatch will be processed by programC. Similarly, glUseProgramStages(pipeline, ~GL_COMPUTE_SHADER_BIT, programAB); glUseProgramStages(pipeline, GL_COMPUTE_SHADER_BIT, programC); glDrawArrays(GL_TRIANGLES, 0, 900); glDispatchCompute(5, 5, 5); will have the triangles processed by the multi-stage programAB. 8) What happens if you try to draw with no active compute program? RESOLVED: An INVALID_OPERATION error is generated if there is no active program for the compute shader stage. 9) Should we increase minimums on certain replicated state bindings (texture image units, uniform buffer bindings) to reflect the addition of a sixth shader stage? RESOLVED: Yes, for MAX_COMBINED_TEXTURE_IMAGE_UNITS and MAX_UNIFORM_BUFFER_BINDINGS. These limits permit applications to statically partition the shared set of texture bindings into six separate sets, one per shader stage. The limit MAX_COMBINED_UNIFORM_BLOCKS is not increased, because it reflects the sum of the number of uniform blocks used in each stage of a single program. Since no single program can have more than five stages, these limits don't need to be increased. 10) How do the shader built-in variables relate to DirectCompute's built-in system values (SV_*)? OpenGL Compute DirectCompute -------------------------------------------------- gl_NumWorkGroups -- gl_WorkGroupSize -- gl_WorkGroupID SV_GroupID gl_LocalInvocationID SV_GroupThreadID gl_GlobalInvocationID SV_DispatchThreadID gl_LocalInvocationIndex SV_GroupIndex 11) How does "program validation" (checking the active programs against the current state) apply to DispatchCompute? RESOLVED: The same program validation logic will be applied to both graphics primitives (e.g., DrawArrays) and compute dispatches. Conditions that will cause validation errors for graphics primitives will also cause validation errors for compute dispatch, even if the conditions wouldn't otherwise affect compute, for example: * Mis-configured program pipeline objects (e.g., inserting a geometry program A between the linked vertex and fragment shaders of of program B). * A graphics program has a vertex shader that uses a 2D texture from texture image unit 0 and a fragment shader that uses a 3D texture from texture image unit 0. Similarly, validation errors specific to the compute shader executable (e.g., using different targets on a single texture image unit in a compute program) will generate validation errors for graphics Draw* calls. We chose to specify this behavior for several reasons. First, using the same logic in both places ensures a single result for ValidateProgram and ValidateProgramPipeline (a single VALIDATE_STATUS value wouldn't be good enough if the result could be different for compute and graphics). Additionally, a single test allows implementations to set up state and perform validation tests for compute and graphics operations at the same time, without requiring additional irregular graphics- or compute-specific logic. 12) We specify an INVALID_OPERATION error for DispatchCompute when there is no active program on the compute stage. Should we specify similar errors for Draw* calls if the current program specified by UseProgram is a compute program? RESOLVED: Not in the current spec. If a compute shader is made current with UseProgram, there will be no active program for either the vertex and fragment stages. In this case, the results of vertex and fragment processing are undefined, but no error is generated. This behavior is already specified in unextended OpenGL 4.2. We don't generate errors in this case for several reasons: * For the compatibility profile, fixed-function vertex and fragment processing is available, and INVALID_OPERATION wouldn't make sense there. * Even in the core profile, there are cases where no active fragment shader is needed (e.g., primitives with RASTERIZER_DISCARD enabled). While there is no case where having only a compute program makes sense, at least in the core profile, we chose to keep the same undefined behavior that's already in place. 13) Should we provide any additional support extending the memoryBarrier() GLSL built-in function provided by ARB_shader_image_load_store and GLSL 4.20? RESOLVED: Yes. The memoryBarrier() function provided by GLSL 4.20 requires (a) synchronizing all memory transactions that might be visible to other shader invocations and (b) ordering memory transactions so that all other shader invocations never see stores issued after the barrier before seeing stores issued before the barrier. Hardware implementations of GLSL 4.20 may have a high degree of parallelism, where the memory subsystem servicing shader loads and stores may have multiple independent sub-units, and where the shader invocations themselves may be executed in parallel on many shader cores. The memoryBarrier() command may be fairly heavyweight, requiring synchronization with all memory sub-units and shader cores. We provide new functions in two different directions that might serve as lighter weight alternatives to memoryBarrier(). In particular, we provide four new functions void memoryBarrierAtomicCounter(); void memoryBarrierBuffer(); void memoryBarrierImage(); void memoryBarrierShared(); that order transactions of only a specific memory type and might require synchronization with fewer sub-units of the memory subsystem and a new function: void groupMemoryBarrier(); that only order transactions as viewed by other threads in the same workgroup, which might not require synchronization with other shader cores. Since shared memory is only accessible to threads within a single workgroup, memoryBarrierShared() also only requires synchronization with other threads in the same workgroup. Revision History Rev. Date Author Changes ---- -------- --------- ----------------------------------------- 28 12/10/18 Jon Leech Use 'workgroup' consistently throughout (Bug 11723, internal API issue 87). 27 07/24/14 Jon Leech Change value of GLSL limit gl_MaxComputeUniformComponents to 512 for consistency with the API (Bug 12370). 26 01/30/14 Jon Leech Add table 6.31 COMPUTE_SHADER entry for program pipeline objects (Bug 11539). 25 10/23/12 pbrown Remove the restriction forbidding the use of barrier() inside potentially divergent flow control. Instead, we will allow barrier() to be executed anywhere, but specify undefined results (including hangs or program termination) if the flow control is divergent (bug 9367). 24 07/01/12 Jon Leech Fix typo (bug 8984). 23 06/28/12 johnk Remove two other references to "thread", add "Only available in compute shaders" to the table for memoryBarrierShared() and groupMemoryBarrier(), fixed a typo. 22 06/22/12 pbrown Add a new built-in memoryBarrierBuffer() as an interaction with ARB_shader_storage_buffer. Add a new built-in groupMemoryBarrier() that orders memory transactions only as observed by other shader invocations in the same work group. Enhance the description of the GLSL memory barrier functions. Add issue 13 about the new memory barrier functions added in this extension (bug 9199). Mark issues 11 and 12 as resolved. Add NV_vertex_buffer_unified_memory interaction allowing DispatchComputeIndirect to read its arguments from any resident buffer object instead of the single bound indirect dispatch buffer. 21 06/21/12 gsellers Clarify that there are no built-in inputs or outputs in compute shaders (bug 9200). 20 06/21/12 gsellers Throw INVALID_OPERATION if querying COMPUTE_WORK_GROUP_SIZE from unlinked program or program with no compute shader (bug 9117). 19 06/18/12 pbrown DispatchComputeIndirect throws INVALID_VALUE if is negative or misaligned (bug 9181). 18 06/17/12 pbrown Clarify that compute-only programs can be used by both UseProgram and UseProgramStages, and add a COMPUTE_SHADER_BIT for UseProgramStages (bug 9155). Specify that validation errors checking programs against each other and the GL state apply equally to graphics primitives (Draw*) and compute dispatches. Update issue 7; add new issues 11 and 12. Clarify that compute shader invocations in a workgroup are run "potentially in parallel", but not "in lockstep" (bug 9151). Other minor wording improvements. 17 06/15/12 johnk Don't allow location layout qualifiers for compute shader inputs. 16 06/15/12 johnk In the intro material, allow work groups to only potentially execute in parallel, and use control barriers to synchronize. Other minor fixes. 15 06/15/12 dgkoch Added Additions to Ch.2 of Shading Language. Renamed shader built-in variables, explained them better, made them uvec3 instead of int[3]. Added derived shading language variables. Renamed and changed built-in constants for consistency with the variables. Removed gl_MaxComputeWorkDimensions since it is no longer necessary. Renamed API constants to be consistent with shading language terminology. Remove a few rogue references to variable number of dispatch arguments. Added Issue 10. (bugs 9151, 9167) 14 06/14/12 pbrown Modify DispatchComputeIndirect to accept an "intptr"-typed offset instead of a "void *", since doesn't accept pointers to client memory. Modify DispatchComputeIndirect to use a new buffer binding (DISPATCH_INDIRECT_BUFFER) instead of sharing the binding used by Draw*Indirect. Add missing entries in the "New Tokens" section and assign values. Update documentation of COMMAND_BARRIER_BIT to reflect the new dispatch indirect binding. Document DispatchComputeIndirect errors for offsets that are negative, misaligned, or run off the end of the bound buffer. Increase minimums for combined texture image units and uniform buffer bindings to reflect the new stage. Update various issues, add new issue 9 (bug 9130). 13 06/14/12 Jon Leech Copy description of MAX_COMPUTE_SHARED_MEMORY_SIZE into API spec from GLSL spec (bug 9069). 12 05/14/12 pbrown Add interaction with ARB_shader_storage_buffer_ object. The built-in functions provided there for atomic memory operations on buffer variables are also supported for the shared variables provided here. The functions themselves are documented fully in the other specification. 11 05/14/12 johnk Keep the previous logical contents of the last paragraph of the memory shader control functions. 10 04/26/12 gsellers Count max compute shared variable size in bytes. Make shared variables implicitly coherent. Add MAX_COMPUTE_UNIFORM_COMPONENTS. Clean up MAX_COMPUTE_IMAGE_UNIFORMS. 9 04/25/12 gsellers Add UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER and ATOMIC_COUNTER_BUFFER_REFERENCED_BY_- COMPUTE_SHADER. Remove from dispatch APIs. Add memoryBarrier{Image,Shared, AtomicCounter}(). 8 04/05/12 gsellers Remove ARB suffixes. 7 02/02/12 gsellers Require OpenGL 4.2. Add issue 8. Up various minimums. Remove variable dimensionality. 6 01/24/12 gsellers Require OpenGL 3.0. Incorporate feedback from bmerry. Add compute shader constants to sec. 7.7. Add modifications to sec. 8.15 of the GLSL spec. Add issue 7. 5 01/20/12 gsellers Make compute dispatch honor conditional rendering. Add indirect dispatch. Change 'global work size' to 'num work groups', make global size in multiples of work group size. 4 01/10/12 gsellers Fix typos and other small corrections. Make specification of work group size at compile time compulsory. Add COMPUTE_WORK_DIMENSION_ARB and COMPUTE_LOCAL_WORK_SIZE_ARB queries. Add issue (5), resolve issues (3) and (4). 3 01/09/12 gsellers Change from AMD to ARB. Update to be relative to OpenGL 4.2 (+GLSL 4.20). Add variables. Add issues (1) - (4). Add link failure for programs that contain compute and non-compute shaders. 2 06/10/11 gsellers Add error behavior. Shading language changes. Add global_offset parameter. Add implementation dependent limits. 1 09/24/10 gsellers Initial revision