Vulkan Logo

9. Shaders

A shader specifies programmable operations that execute for each vertex, control point, tessellated vertex, primitive, fragment, or workgroup in the corresponding stage(s) of the graphics and compute pipelines.

Graphics pipelines include vertex shader execution as a result of primitive assembly, followed, if enabled, by tessellation control and evaluation shaders operating on patches, geometry shaders, if enabled, operating on primitives, and fragment shaders, if present, operating on fragments generated by Rasterization. In this specification, vertex, tessellation control, tessellation evaluation and geometry shaders are collectively referred to as pre-rasterization shader stages and occur in the logical pipeline before rasterization. The fragment shader occurs logically after rasterization.

Only the compute shader stage is included in a compute pipeline. Compute shaders operate on compute invocations in a workgroup.

Shaders can read from input variables, and read from and write to output variables. Input and output variables can be used to transfer data between shader stages, or to allow the shader to interact with values that exist in the execution environment. Similarly, the execution environment provides constants describing capabilities.

Shader variables are associated with execution environment-provided inputs and outputs using built-in decorations in the shader. The available decorations for each stage are documented in the following subsections.

9.1. Shader Modules

Shader modules contain shader code and one or more entry points. Shaders are selected from a shader module by specifying an entry point as part of pipeline creation. The stages of a pipeline can use shaders that come from different modules. The shader code defining a shader module must be in the SPIR-V format, as described by the Vulkan Environment for SPIR-V appendix.

Shader modules are represented by VkShaderModule handles:

// Provided by VK_VERSION_1_0
VK_DEFINE_NON_DISPATCHABLE_HANDLE(VkShaderModule)

To create a shader module, call:

// Provided by VK_VERSION_1_0
VkResult vkCreateShaderModule(
    VkDevice                                    device,
    const VkShaderModuleCreateInfo*             pCreateInfo,
    const VkAllocationCallbacks*                pAllocator,
    VkShaderModule*                             pShaderModule);
  • device is the logical device that creates the shader module.

  • pCreateInfo is a pointer to a VkShaderModuleCreateInfo structure.

  • pAllocator controls host memory allocation as described in the Memory Allocation chapter.

  • pShaderModule is a pointer to a VkShaderModule handle in which the resulting shader module object is returned.

Once a shader module has been created, any entry points it contains can be used in pipeline shader stages as described in Compute Pipelines and Graphics Pipelines.

Valid Usage
  • VUID-vkCreateShaderModule-pCreateInfo-06904
    If pCreateInfo is not NULL, pCreateInfo->pNext must be NULL

Valid Usage (Implicit)
  • VUID-vkCreateShaderModule-device-parameter
    device must be a valid VkDevice handle

  • VUID-vkCreateShaderModule-pCreateInfo-parameter
    pCreateInfo must be a valid pointer to a valid VkShaderModuleCreateInfo structure

  • VUID-vkCreateShaderModule-pAllocator-parameter
    If pAllocator is not NULL, pAllocator must be a valid pointer to a valid VkAllocationCallbacks structure

  • VUID-vkCreateShaderModule-pShaderModule-parameter
    pShaderModule must be a valid pointer to a VkShaderModule handle

Return Codes
Success
  • VK_SUCCESS

Failure
  • VK_ERROR_OUT_OF_HOST_MEMORY

  • VK_ERROR_OUT_OF_DEVICE_MEMORY

The VkShaderModuleCreateInfo structure is defined as:

// Provided by VK_VERSION_1_0
typedef struct VkShaderModuleCreateInfo {
    VkStructureType              sType;
    const void*                  pNext;
    VkShaderModuleCreateFlags    flags;
    size_t                       codeSize;
    const uint32_t*              pCode;
} VkShaderModuleCreateInfo;
  • sType is a VkStructureType value identifying this structure.

  • pNext is NULL or a pointer to a structure extending this structure.

  • flags is reserved for future use.

  • codeSize is the size, in bytes, of the code pointed to by pCode.

  • pCode is a pointer to code that is used to create the shader module. The type and format of the code is determined from the content of the memory addressed by pCode.

Valid Usage
  • VUID-VkShaderModuleCreateInfo-codeSize-08735
    codeSize must be a multiple of 4

  • VUID-VkShaderModuleCreateInfo-pCode-08736
    pCode must point to valid SPIR-V code, formatted and packed as described by the Khronos SPIR-V Specification

  • VUID-VkShaderModuleCreateInfo-pCode-08737
    pCode must adhere to the validation rules described by the Validation Rules within a Module section of the SPIR-V Environment appendix

  • VUID-VkShaderModuleCreateInfo-pCode-08738
    pCode must declare the Shader capability for SPIR-V code

  • VUID-VkShaderModuleCreateInfo-pCode-08739
    pCode must not declare any capability that is not supported by the API, as described by the Capabilities section of the SPIR-V Environment appendix

  • VUID-VkShaderModuleCreateInfo-pCode-08740
    and pCode declares any of the capabilities listed in the SPIR-V Environment appendix, one of the corresponding requirements must be satisfied

  • VUID-VkShaderModuleCreateInfo-pCode-08741
    pCode must not declare any SPIR-V extension that is not supported by the API, as described by the Extension section of the SPIR-V Environment appendix

  • VUID-VkShaderModuleCreateInfo-pCode-08742
    and pCode declares any of the SPIR-V extensions listed in the SPIR-V Environment appendix, one of the corresponding requirements must be satisfied

  • VUID-VkShaderModuleCreateInfo-codeSize-01085
    codeSize must be greater than 0

Valid Usage (Implicit)
  • VUID-VkShaderModuleCreateInfo-sType-sType
    sType must be VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO

  • VUID-VkShaderModuleCreateInfo-flags-zerobitmask
    flags must be 0

  • VUID-VkShaderModuleCreateInfo-pCode-parameter
    pCode must be a valid pointer to an array of uint32_t values

// Provided by VK_VERSION_1_0
typedef VkFlags VkShaderModuleCreateFlags;

VkShaderModuleCreateFlags is a bitmask type for setting a mask, but is currently reserved for future use.

To destroy a shader module, call:

// Provided by VK_VERSION_1_0
void vkDestroyShaderModule(
    VkDevice                                    device,
    VkShaderModule                              shaderModule,
    const VkAllocationCallbacks*                pAllocator);
  • device is the logical device that destroys the shader module.

  • shaderModule is the handle of the shader module to destroy.

  • pAllocator controls host memory allocation as described in the Memory Allocation chapter.

A shader module can be destroyed while pipelines created using its shaders are still in use.

Valid Usage
  • VUID-vkDestroyShaderModule-shaderModule-01092
    If VkAllocationCallbacks were provided when shaderModule was created, a compatible set of callbacks must be provided here

  • VUID-vkDestroyShaderModule-shaderModule-01093
    If no VkAllocationCallbacks were provided when shaderModule was created, pAllocator must be NULL

Valid Usage (Implicit)
  • VUID-vkDestroyShaderModule-device-parameter
    device must be a valid VkDevice handle

  • VUID-vkDestroyShaderModule-shaderModule-parameter
    If shaderModule is not VK_NULL_HANDLE, shaderModule must be a valid VkShaderModule handle

  • VUID-vkDestroyShaderModule-pAllocator-parameter
    If pAllocator is not NULL, pAllocator must be a valid pointer to a valid VkAllocationCallbacks structure

  • VUID-vkDestroyShaderModule-shaderModule-parent
    If shaderModule is a valid handle, it must have been created, allocated, or retrieved from device

Host Synchronization
  • Host access to shaderModule must be externally synchronized

9.2. Binding Shaders

Before a shader can be used it must be first bound to the command buffer.

Calling vkCmdBindPipeline binds all stages corresponding to the VkPipelineBindPoint.

The following table describes the relationship between shader stages and pipeline bind points:

Shader stage Pipeline bind point behavior controlled
  • VK_SHADER_STAGE_VERTEX_BIT

  • VK_SHADER_STAGE_TESSELLATION_CONTROL_BIT

  • VK_SHADER_STAGE_TESSELLATION_EVALUATION_BIT

  • VK_SHADER_STAGE_GEOMETRY_BIT

  • VK_SHADER_STAGE_FRAGMENT_BIT

VK_PIPELINE_BIND_POINT_GRAPHICS

all drawing commands

  • VK_SHADER_STAGE_COMPUTE_BIT

VK_PIPELINE_BIND_POINT_COMPUTE

all dispatch commands

9.3. Shader Execution

At each stage of the pipeline, multiple invocations of a shader may execute simultaneously. Further, invocations of a single shader produced as the result of different commands may execute simultaneously. The relative execution order of invocations of the same shader type is undefined. Shader invocations may complete in a different order than that in which the primitives they originated from were drawn or dispatched by the application. However, fragment shader outputs are written to attachments in rasterization order.

The relative execution order of invocations of different shader types is largely undefined. However, when invoking a shader whose inputs are generated from a previous pipeline stage, the shader invocations from the previous stage are guaranteed to have executed far enough to generate input values for all required inputs.

9.3.1. Shader Termination

A shader invocation that is terminated has finished executing instructions.

Executing OpReturn in the entry point, or executing OpTerminateInvocation in any function will terminate an invocation. Implementations may also terminate a shader invocation when OpKill is executed in any function; otherwise it becomes a helper invocation.

In addition to the above conditions, helper invocations are terminated when all non-helper invocations in the same derivative group either terminate or become helper invocations via OpKill.

A shader stage for a given command completes execution when all invocations for that stage have terminated.

9.4. Shader Memory Access Ordering

The order in which image or buffer memory is read or written by shaders is largely undefined. For some shader types (vertex, tessellation evaluation, and in some cases, fragment), even the number of shader invocations that may perform loads and stores is undefined.

In particular, the following rules apply:

  • Vertex and tessellation evaluation shaders will be invoked at least once for each unique vertex, as defined in those sections.

  • Fragment shaders will be invoked zero or more times, as defined in that section.

  • The relative execution order of invocations of the same shader type is undefined. A store issued by a shader when working on primitive B might complete prior to a store for primitive A, even if primitive A is specified prior to primitive B. This applies even to fragment shaders; while fragment shader outputs are always written to the framebuffer in rasterization order, stores executed by fragment shader invocations are not.

  • The relative execution order of invocations of different shader types is largely undefined.

Note

The above limitations on shader invocation order make some forms of synchronization between shader invocations within a single set of primitives unimplementable. For example, having one invocation poll memory written by another invocation assumes that the other invocation has been launched and will complete its writes in finite time.

The Memory Model appendix defines the terminology and rules for how to correctly communicate between shader invocations, such as when a write is Visible-To a read, and what constitutes a Data Race.

Applications must not cause a data race.

The SPIR-V SubgroupMemory, CrossWorkgroupMemory, and AtomicCounterMemory memory semantics are ignored. Sequentially consistent atomics and barriers are not supported and SequentiallyConsistent is treated as AcquireRelease. SequentiallyConsistent should not be used.

9.5. Shader Inputs and Outputs

Data is passed into and out of shaders using variables with input or output storage class, respectively. User-defined inputs and outputs are connected between stages by matching their Location decorations. Additionally, data can be provided by or communicated to special functions provided by the execution environment using BuiltIn decorations.

In many cases, the same BuiltIn decoration can be used in multiple shader stages with similar meaning. The specific behavior of variables decorated as BuiltIn is documented in the following sections.

9.6. Vertex Shaders

Each vertex shader invocation operates on one vertex and its associated vertex attribute data, and outputs one vertex and associated data. Graphics pipelines must include a vertex shader, and the vertex shader stage is always the first shader stage in the graphics pipeline.

9.6.1. Vertex Shader Execution

A vertex shader must be executed at least once for each vertex specified by a drawing command. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view. During execution, the shader is presented with the index of the vertex and instance for which it has been invoked. Input variables declared in the vertex shader are filled by the implementation with the values of vertex attributes associated with the invocation being executed.

If the same vertex is specified multiple times in a drawing command (e.g. by including the same index value multiple times in an index buffer) the implementation may reuse the results of vertex shading if it can statically determine that the vertex shader invocations will produce identical results.

Note

It is implementation-dependent when and if results of vertex shading are reused, and thus how many times the vertex shader will be executed. This is true also if the vertex shader contains stores or atomic operations (see vertexPipelineStoresAndAtomics).

9.7. Tessellation Control Shaders

The tessellation control shader is used to read an input patch provided by the application and to produce an output patch. Each tessellation control shader invocation operates on an input patch (after all control points in the patch are processed by a vertex shader) and its associated data, and outputs a single control point of the output patch and its associated data, and can also output additional per-patch data. The input patch is sized according to the patchControlPoints member of VkPipelineTessellationStateCreateInfo, as part of input assembly.

The size of the output patch is controlled by the OpExecutionMode OutputVertices specified in the tessellation control or tessellation evaluation shaders, which must be specified in at least one of the shaders. The size of the input and output patches must each be greater than zero and less than or equal to VkPhysicalDeviceLimits::maxTessellationPatchSize.

9.7.1. Tessellation Control Shader Execution

A tessellation control shader is invoked at least once for each output vertex in a patch. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view.

Inputs to the tessellation control shader are generated by the vertex shader. Each invocation of the tessellation control shader can read the attributes of any incoming vertices and their associated data. The invocations corresponding to a given patch execute logically in parallel, with undefined relative execution order. However, the OpControlBarrier instruction can be used to provide limited control of the execution order by synchronizing invocations within a patch, effectively dividing tessellation control shader execution into a set of phases. Tessellation control shaders will read undefined values if one invocation reads a per-vertex or per-patch output written by another invocation at any point during the same phase, or if two invocations attempt to write different values to the same per-patch output in a single phase.

9.8. Tessellation Evaluation Shaders

The Tessellation Evaluation Shader operates on an input patch of control points and their associated data, and a single input barycentric coordinate indicating the invocation’s relative position within the subdivided patch, and outputs a single vertex and its associated data.

9.8.1. Tessellation Evaluation Shader Execution

A tessellation evaluation shader is invoked at least once for each unique vertex generated by the tessellator. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view.

9.9. Geometry Shaders

The geometry shader operates on a group of vertices and their associated data assembled from a single input primitive, and emits zero or more output primitives and the group of vertices and their associated data required for each output primitive.

9.9.1. Geometry Shader Execution

A geometry shader is invoked at least once for each primitive produced by the tessellation stages, or at least once for each primitive generated by primitive assembly when tessellation is not in use. A shader can request that the geometry shader runs multiple instances. A geometry shader is invoked at least once for each instance. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view.

9.10. Fragment Shaders

Fragment shaders are invoked as a fragment operation in a graphics pipeline. Each fragment shader invocation operates on a single fragment and its associated data. With few exceptions, fragment shaders do not have access to any data associated with other fragments and are considered to execute in isolation of fragment shader invocations associated with other fragments.

9.11. Compute Shaders

Compute shaders are invoked via vkCmdDispatch and vkCmdDispatchIndirect commands. In general, they have access to similar resources as shader stages executing as part of a graphics pipeline.

Compute workloads are formed from groups of work items called workgroups and processed by the compute shader in the current compute pipeline. A workgroup is a collection of shader invocations that execute the same shader, potentially in parallel. Compute shaders execute in global workgroups which are divided into a number of local workgroups with a size that can be set by assigning a value to the LocalSize or LocalSizeId execution mode or via an object decorated by the WorkgroupSize decoration. An invocation within a local workgroup can share data with other members of the local workgroup through shared variables and issue memory and control flow barriers to synchronize with other members of the local workgroup.

9.12. Interpolation Decorations

Variables in the Input storage class in a fragment shader’s interface are interpolated from the values specified by the primitive being rasterized.

Note

Interpolation decorations can be present on input and output variables in pre-rasterization shaders but have no effect on the interpolation performed.

An undecorated input variable will be interpolated with perspective-correct interpolation according to the primitive type being rasterized. Lines and polygons are interpolated in the same way as the primitive’s clip coordinates. If the NoPerspective decoration is present, linear interpolation is instead used for lines and polygons. For points, as there is only a single vertex, input values are never interpolated and instead take the value written for the single vertex.

If the Flat decoration is present on an input variable, the value is not interpolated, and instead takes its value directly from the provoking vertex. Fragment shader inputs that are signed or unsigned integers, integer vectors, or any double-precision floating-point type must be decorated with Flat.

Interpolation of input variables is performed at an implementation-defined position within the fragment area being shaded. The position is further constrained as follows:

  • If the Centroid decoration is used, the interpolation position used for the variable must also fall within the bounds of the primitive being rasterized.

  • If the Sample decoration is used, the interpolation position used for the variable must be at the position of the sample being shaded by the current fragment shader invocation.

  • If a sample count of 1 is used, the interpolation position must be at the center of the fragment area.

Note

As Centroid restricts the possible interpolation position to the covered area of the primitive, the position can be forced to vary between neighboring fragments when it otherwise would not. Derivatives calculated based on these differing locations can produce inconsistent results compared to undecorated inputs. It is recommended that input variables used in derivative calculations are not decorated with Centroid.

9.13. Static Use

A SPIR-V module declares a global object in memory using the OpVariable instruction, which results in a pointer x to that object. A specific entry point in a SPIR-V module is said to statically use that object if that entry point’s call tree contains a function containing a instruction with x as an id operand. A shader entry point also statically uses any variables explicitly declared in its interface.

9.14. Scope

A scope describes a set of shader invocations, where each such set is a scope instance. Each invocation belongs to one or more scope instances, but belongs to no more than one scope instance for each scope.

The operations available between invocations in a given scope instance vary, with smaller scopes generally able to perform more operations, and with greater efficiency.

9.14.1. Cross Device

All invocations executed in a Vulkan instance fall into a single cross device scope instance.

Whilst the CrossDevice scope is defined in SPIR-V, it is disallowed in Vulkan. API synchronization commands can be used to communicate between devices.

9.14.2. Device

All invocations executed on a single device form a device scope instance.

If the vulkanMemoryModel and vulkanMemoryModelDeviceScope features are enabled, this scope is represented in SPIR-V by the Device Scope, which can be used as a Memory Scope for barrier and atomic operations.

There is no method to synchronize the execution of these invocations within SPIR-V, and this can only be done with API synchronization primitives.

Invocations executing on different devices in a device group operate in separate device scope instances.

9.14.3. Queue Family

Invocations executed by queues in a given queue family form a queue family scope instance.

This scope is identified in SPIR-V as the QueueFamily Scope if the vulkanMemoryModel feature is enabled, or if not, the Device Scope, which can be used as a Memory Scope for barrier and atomic operations.

There is no method to synchronize the execution of these invocations within SPIR-V, and this can only be done with API synchronization primitives.

Each invocation in a queue family scope instance must be in the same device scope instance.

9.14.4. Command

Any shader invocations executed as the result of a single command such as vkCmdDispatch or vkCmdDraw form a command scope instance. For indirect drawing commands with drawCount greater than one, invocations from separate draws are in separate command scope instances.

There is no specific Scope for communication across invocations in a command scope instance. As this has a clear boundary at the API level, coordination here can be performed in the API, rather than in SPIR-V.

Each invocation in a command scope instance must be in the same queue-family scope instance.

For shaders without defined workgroups, this set of invocations forms an invocation group as defined in the SPIR-V specification.

9.14.5. Primitive

Any fragment shader invocations executed as the result of rasterization of a single primitive form a primitive scope instance.

There is no specific Scope for communication across invocations in a primitive scope instance.

Any generated helper invocations are included in this scope instance.

Each invocation in a primitive scope instance must be in the same command scope instance.

Any input variables decorated with Flat are uniform within a primitive scope instance.

9.14.6. Workgroup

A local workgroup is a set of invocations that can synchronize and share data with each other using memory in the Workgroup storage class.

The Workgroup Scope can be used as both an Execution Scope and Memory Scope for barrier and atomic operations.

Each invocation in a local workgroup must be in the same command scope instance.

Only compute shaders have defined workgroups - other shader types cannot use workgroup functionality. For shaders that have defined workgroups, this set of invocations forms an invocation group as defined in the SPIR-V specification.

The amount of storage consumed by the variables declared with the Workgroup storage class is implementation-dependent. However, the amount of storage consumed may not exceed the largest block size that would be obtained if all active variables declared with Workgroup storage class were assigned offsets in an arbitrary order by successively taking the smallest valid offset according to the Standard Storage Buffer Layout rules, and with Boolean values considered as 32-bit integer values for the purpose of this calculation. (This is equivalent to using the GLSL std430 layout rules.)

9.14.7. Subgroup

A subgroup (see the subsection “Control Flow” of section 2 of the SPIR-V 1.3 Revision 1 specification) is a set of invocations that can synchronize and share data with each other efficiently.

The Subgroup Scope can be used as both an Execution Scope and Memory Scope for barrier and atomic operations. Other subgroup features allow the use of group operations with subgroup scope.

For shaders that have defined workgroups, each invocation in a subgroup must be in the same local workgroup.

In other shader stages, each invocation in a subgroup must be in the same device scope instance.

Only shader stages that support subgroup operations have defined subgroups.

Note

In shaders, there are two kinds of uniformity that are of primary interest to applications: uniform within an invocation group (a.k.a. dynamically uniform), and uniform within a subgroup scope.

While one could make the assumption that being uniform in invocation group implies being uniform in subgroup scope, it is not necessarily the case for shader stages without defined workgroups.

For shader stages with defined workgroups however, the relationship between invocation group and subgroup scope is well defined as a subgroup is a subset of the workgroup, and the workgroup is the invocation group. If a value is uniform in invocation group, it is by definition also uniform in subgroup scope. This is important if writing code like:

uniform texture2D Textures[];
uint dynamicallyUniformValue = gl_WorkGroupID.x;
vec4 value = texelFetch(Textures[dynamicallyUniformValue], coord, 0);

// subgroupUniformValue is guaranteed to be uniform within the subgroup.
// This value also happens to be dynamically uniform.
vec4 subgroupUniformValue = subgroupBroadcastFirst(dynamicallyUniformValue);

In shader stages without defined workgroups, this gets complicated. Due to scoping rules, there is no guarantee that a subgroup is a subset of the invocation group, which in turn defines the scope for dynamically uniform. In graphics, the invocation group is a single draw command, except for multi-draw situations, and indirect draws with drawCount > 1, where there are multiple invocation groups, one per DrawIndex.

// Assume SubgroupSize = 8, where 3 draws are packed together.
// Two subgroups were generated.
uniform texture2D Textures[];

// DrawIndex builtin is dynamically uniform
uint dynamicallyUniformValue = gl_DrawID;
//              | gl_DrawID = 0 | gl_DrawID = 1 | }
// Subgroup 0: { 0, 0, 0, 0,      1, 1, 1, 1 }
//              | DrawID = 2 | DrawID = 1 | }
// Subgroup 1: { 2, 2, 2, 2,      1, 1, 1, 1 }

uint notActuallyDynamicallyUniformAnymore =
    subgroupBroadcastFirst(dynamicallyUniformValue);
//              | gl_DrawID = 0 | gl_DrawID = 1 | }
// Subgroup 0: { 0, 0, 0, 0,      0, 0, 0, 0 }
//              | gl_DrawID = 2 | gl_DrawID = 1 | }
// Subgroup 1: { 2, 2, 2, 2,      2, 2, 2, 2 }

// Bug. gl_DrawID = 1's invocation group observes both index 0 and 2.
vec4 value = texelFetch(Textures[notActuallyDynamicallyUniformAnymore],
                        coord, 0);

Another problematic scenario is when a shader attempts to help the compiler notice that a value is uniform in subgroup scope to potentially improve performance.

layout(location = 0) flat in dynamicallyUniformIndex;
// Vertex shader might have emitted a value that depends only on gl_DrawID,
// making it dynamically uniform.
// Give knowledge to compiler that the flat input is dynamically uniform,
// as this is not a guarantee otherwise.

uint uniformIndex = subgroupBroadcastFirst(dynamicallyUniformIndex);
// Hazard: If different draw commands are packed into one subgroup, the uniformIndex is wrong.

DrawData d = UBO.perDrawData[uniformIndex];

For implementations where subgroups are packed across draws, the implementation must make sure to handle descriptor indexing correctly. From the specification’s point of view, a dynamically uniform index does not require NonUniform decoration, and such an implementation will likely either promote descriptor indexing into NonUniform on its own, or handle non-uniformity implicitly.

9.14.8. Quad

A quad scope instance is formed of four shader invocations.

In a fragment shader, each invocation in a quad scope instance is formed of invocations in neighboring framebuffer locations (xi, yi), where:

  • i is the index of the invocation within the scope instance.

  • w and h are the number of pixels the fragment covers in the x and y axes.

  • w and h are identical for all participating invocations.

  • (x0) = (x1 - w) = (x2) = (x3 - w)

  • (y0) = (y1) = (y2 - h) = (y3 - h)

  • Each invocation has the same layer and sample indices.

In all shaders, each invocation in a quad scope instance is formed of invocations in adjacent subgroup invocation indices (si), where:

  • i is the index of the invocation within the quad scope instance.

  • (s0) = (s1 - 1) = (s2 - 2) = (s3 - 3)

  • s0 is an integer multiple of 4.

Each invocation in a quad scope instance must be in the same subgroup.

In a fragment shader, each invocation in a quad scope instance must be in the same primitive scope instance.

Fragment and compute shaders have defined quad scope instances. If the quadOperationsInAllStages limit is supported, any shader stages that support subgroup operations also have defined quad scope instances.

9.14.9. Invocation

The smallest scope is a single invocation; this is represented by the Invocation Scope in SPIR-V.

Fragment shader invocations must be in a primitive scope instance.

All invocations in all stages must be in a command scope instance.

9.15. Group Operations

Group operations are executed by multiple invocations within a scope instance; with each invocation involved in calculating the result. This provides a mechanism for efficient communication between invocations in a particular scope instance.

Group operations all take a Scope defining the desired scope instance to operate within. Only the Subgroup scope can be used for these operations; the subgroupSupportedOperations limit defines which types of operation can be used.

9.15.1. Basic Group Operations

Basic group operations include the use of OpGroupNonUniformElect, OpControlBarrier, OpMemoryBarrier, and atomic operations.

OpGroupNonUniformElect can be used to choose a single invocation to perform a task for the whole group. Only the invocation with the lowest id in the group will return true.

The Memory Model appendix defines the operation of barriers and atomics.

9.15.2. Vote Group Operations

The vote group operations allow invocations within a group to compare values across a group. The types of votes enabled are:

  • Do all active group invocations agree that an expression is true?

  • Do any active group invocations evaluate an expression to true?

  • Do all active group invocations have the same value of an expression?

Note

These operations are useful in combination with control flow in that they allow for developers to check whether conditions match across the group and choose potentially faster code-paths in these cases.

9.15.3. Arithmetic Group Operations

The arithmetic group operations allow invocations to perform scans and reductions across a group. The operators supported are add, mul, min, max, and, or, xor.

For reductions, every invocation in a group will obtain the cumulative result of these operators applied to all values in the group. For exclusive scans, each invocation in a group will obtain the cumulative result of these operators applied to all values in invocations with a lower index in the group. Inclusive scans are identical to exclusive scans, except the cumulative result includes the operator applied to the value in the current invocation.

The order in which these operators are applied is implementation-dependent.

9.15.4. Ballot Group Operations

The ballot group operations allow invocations to perform more complex votes across the group. The ballot functionality allows all invocations within a group to provide a boolean value and get as a result what each invocation provided as their boolean value. The broadcast functionality allows values to be broadcast from an invocation to all other invocations within the group.

9.15.5. Shuffle Group Operations

The shuffle group operations allow invocations to read values from other invocations within a group.

9.15.6. Shuffle Relative Group Operations

The shuffle relative group operations allow invocations to read values from other invocations within the group relative to the current invocation in the group. The relative operations supported allow data to be shifted up and down through the invocations within a group.

9.15.7. Clustered Group Operations

The clustered group operations allow invocations to perform an operation among partitions of a group, such that the operation is only performed within the group invocations within a partition. The partitions for clustered group operations are consecutive power-of-two size groups of invocations and the cluster size must be known at pipeline creation time. The operations supported are add, mul, min, max, and, or, xor.

9.15.8. Rotate Group Operations

The rotate group operations allow invocations to read values from other invocations within the group relative to the current invocation and modulo the size of the group. Clustered rotate group operations perform the same operation within individual partitions of a group.

The partitions for clustered rotate group operations are consecutive power-of-two size groups of invocations and the cluster size must be known at pipeline creation time.

9.16. Quad Group Operations

Quad group operations (OpGroupNonUniformQuad*) are a specialized type of group operations that only operate on quad scope instances. Whilst these instructions do include a Scope parameter, this scope is always overridden; only the quad scope instance is included in its execution scope.

Fragment shaders that statically execute either OpGroupNonUniformQuadBroadcast or OpGroupNonUniformQuadSwap must launch sufficient invocations to ensure their correct operation; additional helper invocations are launched for framebuffer locations not covered by rasterized fragments if necessary.

The index used to select participating invocations is i, as described for a quad scope instance, defined as the quad index in the SPIR-V specification.

For OpGroupNonUniformQuadBroadcast this value is equal to Index. For OpGroupNonUniformQuadSwap, it is equal to the implicit Index used by each participating invocation.

9.17. Derivative Operations

Derivative operations calculate the partial derivative for an expression P as a function of an invocation’s x and y coordinates.

Derivative operations operate on a set of invocations known as a derivative group as defined in the SPIR-V specification.

A derivative group in a fragment shader is equivalent to the primitive scope instance.

Derivatives are calculated assuming that P is piecewise linear and continuous within the derivative group.

The following control-flow restrictions apply to derivative operations:

  • dynamic instances of explicit derivative instructions (OpDPdx*, OpDPdy*, and OpFwidth*) must be executed in control flow that is uniform within a derivative group.

  • dynamic instances of implicit derivative operations can be executed in control flow that is not uniform within the derivative group, but results are undefined.

Fragment shaders that statically execute derivative operations must launch sufficient invocations to ensure their correct operation; additional helper invocations are launched for framebuffer locations not covered by rasterized fragments if necessary.

Derivative operations calculate their results as the difference between the result of P across invocations in the quad. For fine derivative operations (OpDPdxFine and OpDPdyFine), the values of DPdx(Pi) are calculated as

DPdx(P0) = DPdx(P1) = P1 - P0

DPdx(P2) = DPdx(P3) = P3 - P2

and the values of DPdy(Pi) are calculated as

DPdy(P0) = DPdy(P2) = P2 - P0

DPdy(P1) = DPdy(P3) = P3 - P1

where i is the index of each invocation as described in Quad.

Coarse derivative operations (OpDPdxCoarse and OpDPdyCoarse), calculate their results in roughly the same manner, but may only calculate two values instead of four (one for each of DPdx and DPdy), reusing the same result no matter the originating invocation. If an implementation does this, it should use the fine derivative calculations described for P0.

Note

Derivative values are calculated between fragments rather than pixels. If the fragment shader invocations involved in the calculation cover multiple pixels, these operations cover a wider area, resulting in larger derivative values. This in turn will result in a coarser LOD being selected for image sampling operations using derivatives.

Applications may want to account for this when using multi-pixel fragments; if pixel derivatives are desired, applications should use explicit derivative operations and divide the results by the size of the fragment in each dimension as follows:

DPdx(Pn)' = DPdx(Pn) / w

DPdy(Pn)' = DPdy(Pn) / h

where w and h are the size of the fragments in the quad, and DPdx(Pn)' and DPdy(Pn)' are the pixel derivatives.

The results for OpDPdx and OpDPdy may be calculated as either fine or coarse derivatives, with implementations favouring the most efficient approach. Implementations must choose coarse or fine consistently between the two.

Executing OpFwidthFine, OpFwidthCoarse, or OpFwidth is equivalent to executing the corresponding OpDPdx* and OpDPdy* instructions, taking the absolute value of the results, and summing them.

Executing an OpImage*Sample*ImplicitLod instruction is equivalent to executing OpDPdx(Coordinate) and OpDPdy(Coordinate), and passing the results as the Grad operands dx and dy.

Note

It is expected that using the ImplicitLod variants of sampling functions will be substantially more efficient than using the ExplicitLod variants with explicitly generated derivatives.

9.18. Helper Invocations

When performing derivative or quad group operations in a fragment shader, additional invocations may be spawned in order to ensure correct results. These additional invocations are known as helper invocations and can be identified by a non-zero value in the HelperInvocation built-in. Stores and atomics performed by helper invocations must not have any effect on memory except for the Function, Private and Output storage classes, and values returned by atomic instructions in helper invocations are undefined.

Note

While storage to Output storage class has an effect even in helper invocations, it does not mean that helper invocations have an effect on the framebuffer. Output variables in fragment shaders can be read from as well, and they behave more like Private variables for the duration of the shader invocation.

Helper invocations may be considered inactive for group operations other than derivative and quad group operations. All invocations in a quad scope instance may become permanently inactive at any point once the only remaining invocations in that quad scope instance are helper invocations.