## Name Strings

cl_intel_subgroups

## Contact

Ben Ashbaugh, Intel (ben 'dot' ashbaugh 'at' intel 'dot' com)

## Contributors

Ben Ashbaugh, Intel
Allen Hux, Intel
Pranayini Gudali, Intel
Dawid Dominiak, Intel
Biju George, Intel

Final Draft

## Version

Built On: 2019-10-23
Revision: 8

## Dependencies

OpenCL 1.2 is required. Some features (get_num_enqueued_sub_groups() and the sub_group_barrier() function that accept a memory scope) require OpenCL 2.0.

This extension is written against revision 24 of the OpenCL 2.0 API specification, against revision 24 of the OpenCL 2.0 OpenCL C specification, and against revision 24 of the OpenCL 2.0 extension specification.

## Overview

The goal of this extension is to allow programmers to improve the performance of their applications by taking advantage of the fact that some work items in a work group execute together as a group (a "subgroup"), and that work items in a subgroup can take advantage of hardware features that are not available to work items in a work group. Specifically, this extension is designed to allow work items in a subgroup to share data without the use of local memory and work group barriers, and to utilize specialized hardware to load and store blocks of data.

There is a large amount of overlap between the functionality in this extension and the functionality in the Khronos subgroups extension cl_khr_subgroups, so this extension reuses many of the names, concepts, and functions already described by the cl_khr_subgroups extension. The key differences between the Intel subgroups extension and the Khronos subgroups extension are:

• The Khronos subgroups extension requires OpenCL 2.0, but the Intel subgroups extension may be available on OpenCL 1.2 devices.

• The Khronos subgroups extension guarantees that subgroups in a work group will make independent forward progress, but the Intel extension does not guarantee that subgroups in a work group will make independent forward progress.

• The Intel extension adds a rich set of subgroup "shuffle" functions to allow work items within a work group to interchange data without the use of local memory and work group barriers.

• The Intel extension adds a set of subgroup "block read and write" functions to take advantage of specialized hardware to read or write blocks of data from or to buffers or images.

• The Intel subgroups extension does not include the subgroup pipes functions that are included as part of the Khronos subgroups extension.

• The Intel subgroups extension does not include the device-side kernel query functions for subgroups that are included as part of the Khronos subgroups extension.

## New API Functions

This function is copied unchanged from the Khronos subgroups extension:
cl_int clGetKernelSubGroupInfoKHR(
cl_kernel kernel,
cl_device_id device,
cl_kernel_sub_group_info param_name,
size_t input_value_size,
const void* input_value,
size_t param_value_size,
void* param_value,
size_t* param_value_size_ret)

## New API Enums

These enums are copied unchanged from the Khronos subgroups extension:

Accepted as the param_name parameter of clGetKernelSubGroupInfoKHR:

CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE_KHR    0x2033
CL_KERNEL_SUB_GROUP_COUNT_FOR_NDRANGE_KHR       0x2034

## New OpenCL C Functions

These built-in functions are copied unchanged from the Khronos subgroups extension:
uint    get_sub_group_size( void );
uint    get_max_sub_group_size( void );
uint    get_num_sub_groups( void );

uint    get_sub_group_id( void );
uint    get_sub_group_local_id( void );

void    sub_group_barrier( cl_mem_fence_flags flags );

int     sub_group_all( int predicate );
int     sub_group_any( int predicate );

If OpenCL 2.0 is supported:

uint    get_enqueued_num_sub_groups( void );
void    sub_group_barrier( cl_mem_fence_flags flags, memory_scope scope );

For the sub_group_broadcast functions, gentype is int, uint, long, ulong, or float.

If cl_khr_fp16 is supported, gentype also includes half.

If cl_khr_fp64 or doubles are supported, gentype also includes double.

gentype sub_group_broadcast( gentype x, uint sub_group_local_id );

For the sub_group_reduce, sub_group_scan_exclusive, and sub_group_scan_inclusive functions, gentype is int, uint, long, ulong, or float.

If cl_khr_fp16 is supported, gentype also includes half.

If cl_khr_fp64 or doubles are supported, gentype also includes double.

gentype sub_group_reduce_add( gentype x )
gentype sub_group_reduce_min( gentype x )
gentype sub_group_reduce_max( gentype x )

gentype sub_group_scan_exclusive_min( gentype x )
gentype sub_group_scan_exclusive_max( gentype x )

gentype sub_group_scan_inclusive_min( gentype x)
gentype sub_group_scan_inclusive_max( gentype x)
These built-in functions are unique to the Intel subgroups extension and are not part of the Khronos subgroups extension:

For the sub_group_shuffle, sub_group_shuffle_down, sub_group_shuffle_up, and sub_group_shuffle_xor functions, gentype is float, float2, float3, float4, float8, float16, int, int2, int3, int4, int8, int16, uint, uint2, uint3, uint4, uint8, uint16, long, or ulong.

If cl_khr_fp16 is supported, gentype also includes half.

If cl_khr_fp64 or doubles are supported, gentype also includes double.

gentype intel_sub_group_shuffle( gentype data, uint c );
gentype intel_sub_group_shuffle_down(
gentype current, gentype next, uint delta );
gentype intel_sub_group_shuffle_up(
gentype previous, gentype current, uint delta );
gentype intel_sub_group_shuffle_xor( gentype data, uint value );
uint    intel_sub_group_block_read( const __global uint* p );
uint2   intel_sub_group_block_read2( const __global uint* p );
uint4   intel_sub_group_block_read4( const __global uint* p );
uint8   intel_sub_group_block_read8( const __global uint* p );

uint    intel_sub_group_block_read( image2d_t image, int2 byte_coord );
uint2   intel_sub_group_block_read2( image2d_t image, int2 byte_coord );
uint4   intel_sub_group_block_read4( image2d_t image, int2 byte_coord );
uint8   intel_sub_group_block_read8( image2d_t image, int2 byte_coord );

void    intel_sub_group_block_write( __global uint* p, uint data );
void    intel_sub_group_block_write2( __global uint* p, uint2 data );
void    intel_sub_group_block_write4( __global uint* p, uint4 data );
void    intel_sub_group_block_write8( __global uint* p, uint8 data );

void    intel_sub_group_block_write( image2d_t image, int2 byte_coord, uint data );
void    intel_sub_group_block_write2( image2d_t image, int2 byte_coord, uint2 data );
void    intel_sub_group_block_write4( image2d_t image, int2 byte_coord, uint4 data );
void    intel_sub_group_block_write8( image2d_t image, int2 byte_coord, uint8 data );

## Modifications to the OpenCL API Specification

### Modifications to Section 2 - "Glossary"

Add memory_scope_sub_group to the description of Memory Scopes:
Memory Scopes

Memory scopes define a hierarchy of visibilities when analyzing the ordering constraints of memory operations. They are defined by the values of the memory_scope enumeration constant. Current values are memory_scope_work_item (memory constraints only apply to a single work item and in practice only apply to image operations), memory_scope_sub_group (memory-ordering constraints only apply to work items executing in a subgroup), memory_scope_work_group …​

Add memory_scope_sub_group to the description of Scope inclusion:
Scope inclusion

Two actions A and B are defined to have an inclusive scope if they have the same scope P such that: (1) if P is memory_scope_sub_group, and A and B are executed by work items within the same subgroup, or (2) if P is memory_scope_work_group, and A and B are executed by work items within the same workgroup …​

Change the description for Subgroups to:
Subgroup

Subgroups are an implementation-dependent grouping of work items within a work group. The size and number of subgroups is implementation-defined and not exposed in the core OpenCL 2.0 feature set. Subgroups execute concurrently within a work group, but are not guaranteed to make independent forward progress. Subgroups may synchronize internally using subgroup barrier operations without synchronizing with other subgroups.

### Modifications to Section 3.2.1 - "Execution Model: Mapping Work Items Onto an NDRange"

Change the paragraph describing subgroups to:

An implementation of OpenCL may divide each work group into one or more subgroups. The size and number of subgroups is implementation-defined and not exposed in the core OpenCL 2.0 feature set.

### Modifications to Section 3.2.2 - "Execution Model: Execution of Kernel Instances"

Remove the last paragraph describing subgroups and independent forward progress.

### Additions to Section 3.2 - "Execution Model"

This text is largely the same as the text in the Khronos subgroups extension. Only the sentence about independent forward progress has been modified:

Within a work group, work items may be divided into subgroups in an implementation- defined fashion. The mapping of work items to subgroups is implementation-defined and may be queried at runtime. While subgroups may be used in multi-dimensional work groups, each subgroup is 1-dimensional and any given work item may query which subgroup it is a member of.

Work items are mapped into subgroups through a combination of compile-time decisions and the parameters of the dispatch. The mapping to subgroups is invariant for the duration of a kernel’s execution, across dispatches of a given kernel with the same launch parameters, and from one work group to another within the dispatch (excluding the trailing edge work groups in the presence of non-uniform work group sizes). In addition, all subgroups within a work group will be the same size, apart from the subgroup with the maximum index, which may be smaller if the size of the work group is not evenly divisible by the size of the subgroups.

Subgroups execute concurrently within a given work group. Similar to work items within a work group, subgroups executing within a work group are not guaranteed to make independent forward progress. Work items in a subgroup can internally synchronize using subgroup barrier operations without synchronizing with other subgroups.

### Additions to Section 3.3.4 - "Memory Model: Memory Consistency Model"

Add memory_scope_sub_group to the bulleted descriptions of memory scopes:
• memory_scope_sub_group: memory-ordering constraints only apply to work items executing within a single subgroup.

• memory_scope_work_group: …​

In the paragraph after the bulleted descriptions of memory scopes, include memory_scope_sub_group as a valid memory scope for local memory:

... For local memory, memory_scope_sub_group and memory_scope_work_group are valid, and may constrain visibility to the subgroup or workgroup.

### Additions to Section 3.3.5 - "Memory Model: Overview of atomic and fence operations"

Add memory_scope_sub_group to the definition of inclusive scope:
• P is memory_scope_sub_group and A and B are executed by work items within the same subgroup.

• P is memory_scope_work_group …​

### Additions to Section 5.9.3 - "Kernel Object Queries"

This addition is copied unchanged from the Khronos subgroups extension:

The function

cl_int clGetKernelSubGroupInfoKHR(cl_kernel kernel,
cl_device_id device,
cl_kernel_sub_group_info param_name,
size_t input_value_size,
const void *input_value,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret)

returns information about the kernel object.

kernel specifies the kernel object being queried.

device identifies a specific device in the list of devices associated with kernel. The list of devices is the list of devices in the OpenCL context that is associated with kernel. If the list of devices associated with kernel is a single device, device can be a NULL value.

param_name specifies the information to query. The list of supported param_name types and the information returned in param_value by clGetKernelSubGroupInfoKHR is described in the table below.

input_value_size is used to specify the size in bytes of memory pointed to by input_value. This size must be equal to the size of input type as described in the table below.

input_value is a pointer to memory where the appropriate parameterization of the query is passed from. If input_value is NULL, it is ignored.

param_value is a pointer to memory where the appropriate result being queried is returned. If param_value is NULL, it is ignored.

param_value_size is used to specify the size in bytes of memory pointed to by param_value. This size must be greater than or equal to the size of the return type as described in the table below.

param_value_size_ret returns the actual size in bytes of data being queried by param_name. If param_value_size_ret is NULL, it is ignored.

Table 1. clGetKernelSubGroupInfoKHR parameter queries
cl_kernel_sub_group_info Input Type Return Type Info. returned in param_value

CL_​KERNEL_​MAX_​SUB_​GROUP_​SIZE_​FOR_​NDRANGE_​KHR

size_t *

size_t

Returns the maximum sub-group size for this kernel. All sub-groups must be the same size, while the last subgroup in any work-group (i.e. the subgroup with the maximum index) could be the same or smaller size.

The input_value must be an array of size_t values corresponding to the local work size parameter of the intended dispatch. The number of dimensions in the ND-range will be inferred from the value specified for input_value_size.

CL_​KERNEL_​SUB_​GROUP_​COUNT_​FOR_​NDRANGE_​KHR

size_t *

size_t

Returns the number of sub-groups that will be present in each work-group for a given local work size. All workgroups, apart from the last work-group in each dimension in the presence of non-uniform work-group sizes, will have the same number of sub-groups.

The input_value must be an array of size_t values corresponding to the local work size parameter of the intended dispatch. The number of dimensions in the ND-range will be inferred from the value specified for input_value_size.

clGetKernelSubGroupInfoKHR returns CL_SUCCESS if the function is executed successfully. Otherwise, it returns one of the following errors:

• CL_INVALID_DEVICE if device is not in the list of devices associated with kernel or if device is NULL but there is more than one device associated with kernel.

• CL_INVALID_VALUE if param_name is not valid, or if size in bytes specified by param_value_size is less than the size of return type as described in the table above and param_value is not NULL.

• CL_INVALID_VALUE if param_name is CL_​KERNEL_​MAX_​SUB_​GROUP_​SIZE_​FOR_​NDRANGE_​KHR and the size in bytes specified by input_value_size is not valid or if input_value is NULL.

• CL_INVALID_KERNEL if kernel is a not a valid kernel object.

• CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.

• CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by the OpenCL implementation on the host.

## Modifications to the OpenCL C Specification

### Additions to section 6.13.1 - "Work Item Functions"

These additions are copied unchanged from the Khronos subgroups extension:
Function Description
uint get_sub_group_size( void )

Returns the number of work items in the subgroup. This value is no more than the maximum subgroup size and is implementation-defined based on a combination of the compiled kernel and the dispatch dimensions. This will be a constant value for the lifetime of the subgroup.

uint get_max_sub_group_size( void )

Returns the maximum size of a subgroup with the dispatch. This value will be invariant for a given set of dispatch dimensions and a kernel object compiled for a given device.

uint get_num_sub_groups( void )

Returns the number of subgroups that the current work group is divided into.

This number will be constant for the duration of a work group’s execution. If the kernel is executed with a non-uniform work group size in any dimension, calls to this built-in may return a different values for some work groups than for other work groups.

uint get_sub_group_id( void )

Returns the subgroup ID, which is a number from zero to get_num_sub_groups - 1.

uint get_sub_group_local_id( void )

Returns the unique work item ID within the current subgroup. The mapping from get_local_id to get_sub_group_local_id will be invariant for the lifetime of the work group.

If OpenCL 2.0 is supported:

Function Description
uint get_enqueued_num_sub_groups( void )

Returns the same value as that returned by get_num_sub_groups if the kernel is executed with a uniform work group size. This value will be constant for the entire NDRange.

If the kernel is executed with a non-uniform work group size, returns the number of subgroups in a work group that makes up the uniform region of the global NDRange.

### Additions to Section 6.13.8 - "Synchronization Functions"

These additions are mostly unchanged from the Khronos subgroups extension, with only minor edits for clarity:
Function Description
void sub_group_barrier(
cl_mem_fence_flags flags )

All work items in a subgroup executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the subgroup barrier. This function must be encountered by all work items in a subgroup executing the kernel. These rules apply to NDRanges implemented with uniform and non-uniform work groups.

If sub_group_barrier is inside a conditional statement then all work items within the subgroup must enter the conditional if any work item in the subgroup enters the conditional statement and executes the sub_group_barrier.

If sub_group_barrier is inside a loop, all work items within the subgroup must execute the sub_group_barrier for each iteration of the loop before any are allowed to continue execution beyond the sub_group_barrier.

The sub_group_barrier function also queues a memory fence (reads and writes) to ensure correct ordering of memory operations to local or global memory.

The flags argument specifies the memory address space and can be set to a combination of the following values:

CLK_LOCAL_MEM_FENCE - The sub_group_barrier function will either flush any variables stored in local memory or queue a memory fence to ensure correct ordering of memory operations to local memory.

CLK_GLOBAL_MEM_FENCE - The sub_group_barrier function will queue a memory fence to ensure correct ordering of memory operations to global memory. This can be useful when work items, for example, write to buffer objects and then want to read the updated data from these buffer objects.

If OpenCL 2.0 is supported, add the following to the table above:

Function Description
void sub_group_barrier(
cl_mem_fence_flags flags,
memory_scope scope )

…​

The sub_group_barrier function also supports a variant that specifies the memory scope. For the sub_group_barrier variant that does not take a memory scope, the scope is memory_scope_sub_group.

The scope argument specifies whether the memory accesses of work items in the subgroup to memory address space(s) identified by flags become visible to all work items in the subgroup, the work group, the device, or all SVM devices.

…​

CLK_IMAGE_MEM_FENCE - The sub_group_barrier function will queue a memory fence to ensure correct ordering of memory operations to image objects. This can be useful when work items, for example, write to image objects and then want to read the updated data from these image objects.

### Additions to Section 6.13.11 - "Atomic Functions"

Modify the bullet describing behavior for functions that do not have a memory_scope argument to say:
• The subgroup functions that do not have a memory_scope argument have the same semantics as the corresponding functions with the memory_scope argument set to memory_scope_sub_group. Other functions that do not have a memory_scope argument have the same semantics as the corresponding functions with the memory_scope argument set to memory_scope_device.

The following addition is copied unchanged from the Khronos subgroups extension:
Add the following new value to the enumerated type memory_scope defined in Section 6.13.11.4:
memory_scope_sub_group

The memory_scope_sub_group specifies that the memory ordering constraints given by memory_order apply to work items in a subgroup. This memory scope can be used when performing atomic operations to global or local memory.

### Additions to Section 6.13.15 - "Work Group Functions"

These additions are copied from the Khronos subgroups extension:

The OpenCL C programming language implements the following built-in functions that operate on a subgroup level. These built-in functions must be encountered by all work items in a subgroup executing the kernel. We use the generic type name gentype to indicate the built-in data types int, uint, long, ulong, or float as the type for the arguments.

If cl_khr_fp16 is supported, gentype also includes half.

If cl_khr_fp64 or doubles are supported, gentype also includes double.

Function Description
int sub_group_all( int predicate )

Evaluates predicate for all work items in the subgroup and returns a non-zero value if predicate evaluates to non-zero for all work items in the subgroup.

int sub_group_any( int predicate )

Evaluates predicate for all work items in the subgroup and returns a non-zero value if predicate evaluates to non-zero for any work items in the subgroup.

gentype sub_group_broadcast(
gentype x,
uint sub_group_local_id )

Broadcasts the value of x for work item identified by sub_group_local_id (value returned by get_sub_group_local_id) to all work items in the subgroup. sub_group_local_id must be the same value for all work items in the subgroup.

gentype sub_group_reduce_add( gentype x )
gentype sub_group_reduce_min( gentype x )
gentype sub_group_reduce_max( gentype x )

Returns the result of the specified reduction operation for all values of x specified by work items in a subgroup.

gentype sub_group_scan_exclusive_add( gentype x )
gentype sub_group_scan_exclusive_min( gentype x )
gentype sub_group_scan_exclusive_max( gentype x )

Performs the specified exclusive scan operation of all values x specified by work items in a subgroup. The scan results are returned for each work item.

The scan order is defined by increasing subgroup local ID within the subgroup.

gentype sub_group_scan_inclusive_add( gentype x)
gentype sub_group_scan_inclusive_min( gentype x)
gentype sub_group_scan_inclusive_max( gentype x)

Performs the specified inclusive scan operation of all values x specified by work items in a subgroup. The scan results are returned for each work item.

The scan order is defined by increasing subgroup local ID within the subgroup.

### Add a new Section 6.13.X - "Sub Group Shuffle Functions"

These are new functions:

The OpenCL C programming language implements the following built-in functions to allow data to be exchanged among work items in a subgroup. These built-in functions need not be encountered by all work items in a subgroup executing the kernel, however, data may only be shuffled among work items encountering the subgroup shuffle function. Shuffling data from a work item that does not encounter the subgroup shuffle function will produce undefined results. For these functions, gentype is float, float2, float3, float4, float8, float16, int, int2, int3, int4, int8, int16, uint, uint2, uint3, uint4, uint8, uint16, long, or ulong.

If cl_khr_fp16 is supported, gentype also includes half.

If cl_khr_fp64 or doubles are supported, gentype also includes double.

Function Description
gentype intel_sub_group_shuffle(
gentype data,
uint sub_group_local_id )

Allows data to be arbitrarily transferred between work items in a subgroup. The data that is returned for this work item is the value of data for the work item identified by sub_group_local_id.

sub_group_local_id need not be the same value for all work items in the subgroup. There is no defined behavior for out-of-range sub_group_local_ids.

gentype intel_sub_group_shuffle_down(
gentype current,
gentype next,
uint delta )

Allows data to be transferred from a work item in the subgroup with a higher sub_group_local_id down to a work item in the subgroup with a lower sub_group_local_id.

There are two data sources to this built-in function: current and next. To determine the result of this built-in function, first let the unsigned shuffle index be equivalent to the sum of this work item’s sub_group_local_id plus the specified delta:

If the shuffle index is less than the max_sub_group_size, the result of this built-in function is the value of the current data source for the work item with sub_group_local_id equal to the shuffle index.

If the shuffle index is greater than or equal to the max_sub_group_size but less than twice the max_sub_group_size, the result of this built-in function is the value of the next data source for the work item with sub_group_local_id equal to the shuffle index minus the max_sub_group_size.

All other values of the shuffle index are considered to be out-of-range. There is no defined behavior for out-of-range indices.

delta need not be the same value for all work items in the subgroup.

gentype intel_sub_group_shuffle_up(
gentype previous,
gentype current,
uint delta )

Allows data to be transferred from a work item in the subgroup with a lower sub_group_local_id up to a work item in the subgroup with a higher sub_group_local_id.

There are two data sources to this built-in function: previous and current. To determine the result of this built-in function, first let the signed shuffle index be equivalent to this work item’s sub_group_local_id minus the specified delta:

If the shuffle index is greater than or equal to zero and less than the max_sub_group_size, the result of this built-in function is the value of the current data source for the work item with sub_group_local_id equal to the shuffle index.

If the shuffle index is less than zero but greater than or equal to the negative max_sub_group_size, the result of this built-in function is the value of the previous data source for the work item with sub_group_local_id equal to the shuffle index plus the max_sub_group_size.

All other values of the shuffle index are considered to be out-of-range. There is no defined behavior for out-of-range indices.

delta need not be the same value for all work items in the subgroup.

gentype intel_sub_group_shuffle_xor(
gentype data,
uint value )

Allows data to be transferred between work items in a subgroup as a function of the work item’s sub_group_local_id. The data that is returned for this work item is the value of data for the work item with sub_group_local_id equal to this work item’s sub_group_local_id XOR’d with the specified value. If the result of the XOR is greater than max_sub_group_size then it is considered out-of-range.

value need not be the same for all work items in the subgroup. There is no defined behavior for out-of-range indices.

### Add a new Section 6.13.X - "Sub Group Read and Write Functions"

These are new functions:

The OpenCL C programming language implements the following built-in functions to allow data to be read or written as a block by all work items in a subgroup. These built-in functions must be encountered by all work items in a subgroup executing the kernel. Furthermore, since these are block operations, the pointer, image, and coordinate arguments to these built-in functions must be the same for all work items in the subgroup (when applicable, only the data argument may be different).

Function Description
uint  intel_sub_group_block_read(
const __global uint* p )
const __global uint* p )
const __global uint* p )
const __global uint* p )

Reads 1, 2, 4, or 8 uints of data for each work item in the subgroup from the specified pointer as a block operation. The data is read strided, so the first value read is:

p[ sub_group_local_id ]

and the second value read is:

p[ sub_group_local_id + max_sub_group_size ]

etc.

p must be aligned to a 32-bit (4-byte) boundary.

There is no defined out-of-range behavior for these functions.

uint  intel_sub_group_block_read(
image2d_t image,
int2 byte_coord )
image2d_t image,
int2 byte_coord )
image2d_t image,
int2 byte_coord )
image2d_t image,
int2 byte_coord )

Reads 1, 2, 4, or 8 uints of data for each work item in the subgroup from the specified image at the specified coordinate as a block operation. Note that the coordinate is a byte coordinate, not an image element coordinate. Also note that the image data is read without format conversion, so each work item may read multiple image elements (for images with element size smaller than 16-bits).

The data is read row-by-row, so the first value read is from the row specified in the y-component of the provided byte_coord, the second value is read from the y-component of the provided byte_coord plus one, etc.

Please see the note below describing out-of-bounds behavior for these functions.

void  intel_sub_group_block_write(
__global uint* p, uint data )
void  intel_sub_group_block_write2(
__global uint* p, uint2 data )
void  intel_sub_group_block_write4(
__global uint* p, uint4 data )
void  intel_sub_group_block_write8(
__global uint* p, uint8 data )

Writes 1, 2, 4, or 8 uints of data for each work item in the subgroup to the specified pointer as a block operation. The data is written strided, so the first value is written to:

p[ sub_group_local_id ]

and the second value is written to:

p[ sub_group_local_id + max_sub_group_size ]

etc.

p must be aligned to a 128-bit (16-byte) boundary.

There is no defined out-of-range behavior for these functions.

void  intel_sub_group_block_write(
image2d_t image,
int2 byte_coord, uint data )
void  intel_sub_group_block_write2(
image2d_t image,
int2 byte_coord, uint2 data )
void  intel_sub_group_block_write4(
image2d_t image,
int2 byte_coord, uint4 data )
void  intel_sub_group_block_write8(
image2d_t image,
int2 byte_coord, uint8 data )

Writes 1, 2, 4, or 8 uints of data for each work item in the subgroup to the specified image at the specified coordinate as a block operation. Note that the coordinate is a byte coordinate, not an image element coordinate. Unlike the image block read function, which may read from any arbitrary byte offset, the x-component of the byte coordinate for the image block write functions must be a multiple of four; in other words, the write must begin at 32-bit boundary. There is no restriction on the y-component of the coordinate. Also, note that the image data is written without format conversion, so each work item may write multiple image elements (for images with element size smaller than 8-bits).

The data is written row-by-row, so the first value written is from the row specified by the y-component of the provided byte_coord, the second value is written from the y-component of the provided byte_coord plus one, etc.

Please see the note below describing out-of-bounds behavior for these functions.

Note: The subgroup image block read and write built-ins do support bounds checking, however these built-ins bounds-check to the image width in units of uints, not in units of image elements. This means:

• If the image has an element size equal to the size of a uint (four bytes, for example CL_RGBA + CL_UNORM_INT8), the image will be correctly bounds-checked. In this case, out-of-bounds reads will return the edge image element (the equivalent of CLK_ADDRESS_CLAMP_TO_EDGE), and out-of-bounds writes will be ignored.

• If the image has element size less than the size of a uint (such as CL_R + CL_UNSIGNED_INT8), the entire image is addressable, however bounds checking will occur too late. For this reason, extra care should be taken to avoid out-of-bounds reads and writes, since out-of-bounds reads may return invalid data and out-of-bounds writes may corrupt other images or buffers unpredictably.

Add a new sub-section 6.13.X.1 - Restrictions:

The following restrictions apply to the subgroup buffer block read and write functions:

• The pointer p must be 32-bit (4-byte) aligned for reads, and must be 128-bit (16-byte) aligned for writes.

• If the pointer p is computed from a kernel argument that is a cl_mem that was created with CL_MEM_USE_HOST_PTR, then the host_ptr must be 32-bit (4-byte) aligned for reads, and must be 128-bit (16-byte) aligned for writes.

• If the pointer p is computed from a kernel argument that is a cl_mem that is a sub-buffer, then the origin defining the sub-buffer offset into the buffer must be a multiple of 4 bytes for reads, and must be a multiple of 16 bytes for write, in addition to the CL_DEVICE_MEM_BASE_ADDR_ALIGN requirements. Additionally, if the buffer that the sub-buffer is created from was created with CL_MEM_USE_HOST_PTR, then the host_ptr for the buffer must be 32-bit (4-byte) aligned for reads, and must be 128-bit(16-byte) aligned for writes.

• If the pointer p is computed from an SVM pointer kernel argument, then the SVM pointer kernel argument must be 32-bit (4-byte) aligned for reads, and must be 128-bit (16-byte) aligned for writes.

• Behavior is undefined if the subgroup size is smaller than the maximum subgroup size; in other words, if this is a partial subgroup.

The following restrictions apply to the subgroup image block read and write functions:

• The behavior of the subgroup image block read and write built-ins is undefined for images with an element size greater than four bytes (such as CL_RGBA + CL_FLOAT).

• When reading or writing a 2D image created from a buffer with the subgroup block read and write built-ins, the image row pitch is required to be a multiple of 64-bytes, in addition to the CL_DEVICE_IMAGE_PITCH_ALIGNMENT requirements.

• When reading or writing a 2D image created from a buffer with the subgroup block read and write built-ins, if the buffer is a cl_mem that was created with CL_MEM_USE_HOST_PTR, then the host_ptr must be 256-bit (32-byte) aligned.

• When reading or writing a 2D image created from a buffer with the subgroup block read and write built-ins, if the buffer is a cl_mem that is a sub-buffer, then the origin must be a multiple of 32-bytes. Additionally, if the buffer that the sub-buffer is created from was created with CL_MEM_USE_HOST_PTR, then the host_ptr for the buffer must be 256-bit (32-byte) aligned.

• Behavior is undefined if the subgroup size is smaller than the maximum subgroup size; in other words, if this is a partial subgroup.

None.

## Revision History

Rev Date Author Changes

1

2014-12-01

Ben Ashbaugh

First public revision.

2

2015-03-12

Ben Ashbaugh

Fixed minor formatting errors, added restriction for subgroup image block read and write built-ins with large image formats.

3

2016-02-12

Ben Ashbaugh

Fixed a small bug in the shuffle up and shuffle down descriptions.

4

2016-08-28

Ben Ashbaugh

5

2018-11-15

Ben Ashbaugh

Converted to asciidoc.

6

2018-12-02

Ben Ashbaugh

Added back a section that was inadvertently removed during conversion to asciidoc.

7

2019-01-15

Ben Ashbaugh

Fixed a typo in the summary section of new built-in functions.

8

2019-09-17

Ben Ashbaugh

Added vec3 types for shuffles, restriction for block reads and writes and partial subgroups, and asciidoctor formatting fixes.