cl_intel_subgroups

Name Strings

cl_intel_subgroups_long

Contact

Ben Ashbaugh, Intel (ben 'dot' ashbaugh 'at' intel 'dot' com)

Contributors

Ben Ashbaugh, Intel
Konrad Trifunovic, Intel

Notice

Status

Final Draft

Version

Built On: 2026-01-09
Revision: 1

Dependencies

Support for cl_intel_subgroups is required.

This extension is written against the OpenCL API Specification Version 2.2 (revision v2.2-7), against the OpenCL C Language Specification Version 2.0 (revision v2.2-7), and against version 4 of the cl_intel_subgroups specification.

This extension requires OpenCL support for SPIR-V, either via OpenCL 2.1 or via the cl_khr_il_program extension.

This extension interacts with the cl_intel_spirv_subgroups extension.

Overview

The goal of this extension is to allow programmers to improve the performance of applications operating on 64-bit data types by extending the sub-group functions described in the cl_intel_subgroups extension to support 64-bit integer data types (longs and ulongs). Specifically, the extension:

Extends the Intel sub-group block read and write functions to allow reading and writing 64-bit integer data from images and buffers.

Note that cl_intel_subgroups and cl_khr_subgroups already support broadcasts, scans, and reductions for 64-bit integer types, and that cl_intel_subgroups already supports shuffles for 64-bit integer types.

New API Functions

None.

New API Enums

None.

New OpenCL C Functions

Add ulong variants of the sub-group block read and write functions:

ulong   intel_sub_group_block_read_ul( const __global ulong* p )
ulong2  intel_sub_group_block_read_ul2( const __global ulong* p )
ulong4  intel_sub_group_block_read_ul4( const __global ulong* p )
ulong8  intel_sub_group_block_read_ul8( const __global ulong* p )
ulong   intel_sub_group_block_read_ul( image2d_t image, int2 byte_coord )
ulong2  intel_sub_group_block_read_ul2( image2d_t image, int2 byte_coord )
ulong4  intel_sub_group_block_read_ul4( image2d_t image, int2 byte_coord )
ulong8  intel_sub_group_block_read_ul8( image2d_t image, int2 byte_coord )

void  intel_sub_group_block_write_ul( __global ulong* p, ulong data )
void  intel_sub_group_block_write_ul2( __global ulong* p, ulong2 data )
void  intel_sub_group_block_write_ul4( __global ulong* p, ulong4 data )
void  intel_sub_group_block_write_ul8( __global ulong* p, ulong8 data )
void  intel_sub_group_block_write_ul( image2d_t image, int2 byte_coord, ulong data )
void  intel_sub_group_block_write_ul2( image2d_t image, int2 byte_coord, ulong2 data )
void  intel_sub_group_block_write_ul4( image2d_t image, int2 byte_coord, ulong4 data )
void  intel_sub_group_block_write_ul8( image2d_t image, int2 byte_coord, ulong8 data )

For naming consistency, also add suffixed aliases of the uint sub-group block read and write functions described in the cl_intel_subgroups extension:

uint  intel_sub_group_block_read_ui( const __global uint* p )
uint2 intel_sub_group_block_read_ui2( const __global uint* p )
uint4 intel_sub_group_block_read_ui4( const __global uint* p )
uint8 intel_sub_group_block_read_ui8( const __global uint* p )
uint  intel_sub_group_block_read_ui( image2d_t image, int2 byte_coord )
uint2 intel_sub_group_block_read_ui2( image2d_t image, int2 byte_coord )
uint4 intel_sub_group_block_read_ui4( image2d_t image, int2 byte_coord )
uint8 intel_sub_group_block_read_ui8( image2d_t image, int2 byte_coord )

void  intel_sub_group_block_write_ui( __global uint* p, uint data )
void  intel_sub_group_block_write_ui2( __global uint* p, uint2 data )
void  intel_sub_group_block_write_ui4( __global uint* p, uint4 data )
void  intel_sub_group_block_write_ui8( __global uint* p, uint8 data )
void  intel_sub_group_block_write_ui( image2d_t image, int2 byte_coord, uint data )
void  intel_sub_group_block_write_ui2( image2d_t image, int2 byte_coord, uint2 data )
void  intel_sub_group_block_write_ui4( image2d_t image, int2 byte_coord, uint4 data )
void  intel_sub_group_block_write_ui8( image2d_t image, int2 byte_coord, uint8 data )

Modifications to the OpenCL C Specification

Modifications to Section 6.13.X "Sub-group Read and Write Functions"

This section was added by the cl_intel_subgroups extension.

Add suffixed aliases of the previously un-suffixed 32-bit block read and write functions. There is no change to the description or behavior of these functions:

Function Description

Function	Description
`uint intel_sub_group_block_read( const __global uint* p ) uint2 intel_sub_group_block_read2( const __global uint* p ) uint4 intel_sub_group_block_read4( const __global uint* p ) uint8 intel_sub_group_block_read8( const __global uint* p ) uint intel_sub_group_block_read_ui( const __global uint* p ) uint2 intel_sub_group_block_read_ui2( const __global uint* p ) uint4 intel_sub_group_block_read_ui4( const __global uint* p ) uint8 intel_sub_group_block_read_ui8( const __global uint* p )`	Reads 1, 2, 4, or 8 uints of data for each work item in the sub-group from the specified pointer as a block operation…
uint intel_sub_group_block_read( image2d_t image, int2 byte_coord ) uint2 intel_sub_group_block_read2( image2d_t image, int2 byte_coord ) uint4 intel_sub_group_block_read4( image2d_t image, int2 byte_coord ) uint8 intel_sub_group_block_read8( image2d_t image, int2 byte_coord ) uint intel_sub_group_block_read_ui( image2d_t image, int2 byte_coord ) uint2 intel_sub_group_block_read_ui2( image2d_t image, int2 byte_coord ) uint4 intel_sub_group_block_read_ui4( image2d_t image, int2 byte_coord ) uint8 intel_sub_group_block_read_ui8( image2d_t image, int2 byte_coord )	Reads 1, 2, 4, or 8 uints of data for each work item in the sub-group from the specified image at the specified coordinate as a block operation…
void intel_sub_group_block_write( __global uint* p, uint data ) void intel_sub_group_block_write2( __global uint* p, uint2 data ) void intel_sub_group_block_write4( __global uint* p, uint4 data ) void intel_sub_group_block_write8( __global uint* p, uint8 data ) void intel_sub_group_block_write_ui( __global uint* p, uint data ) void intel_sub_group_block_write_ui2( __global uint* p, uint2 data ) void intel_sub_group_block_write_ui4( __global uint* p, uint4 data ) void intel_sub_group_block_write_ui8( __global uint* p, uint8 data )	Writes 1, 2, 4, or 8 uints of data for each work item in the sub-group to the specified pointer as a block operation…
void intel_sub_group_block_write( image2d_t image, int2 byte_coord, uint data ) void intel_sub_group_block_write2( image2d_t image, int2 byte_coord, uint2 data ) void intel_sub_group_block_write4( image2d_t image, int2 byte_coord, uint4 data ) void intel_sub_group_block_write8( image2d_t image, int2 byte_coord, uint8 data ) void intel_sub_group_block_write_ui( image2d_t image, int2 byte_coord, uint data ) void intel_sub_group_block_write_ui2( image2d_t image, int2 byte_coord, uint2 data ) void intel_sub_group_block_write_ui4( image2d_t image, int2 byte_coord, uint4 data ) void intel_sub_group_block_write_ui8( image2d_t image, int2 byte_coord, uint8 data )	Writes 1, 2, 4, or 8 uints of data for each work item in the sub-group to the specified image at the specified coordinate as a block operation…

uint  intel_sub_group_block_read(
        const __global uint* p )
uint2 intel_sub_group_block_read2(
        const __global uint* p )
uint4 intel_sub_group_block_read4(
        const __global uint* p )
uint8 intel_sub_group_block_read8(
        const __global uint* p )

uint  intel_sub_group_block_read_ui(
        const __global uint* p )
uint2 intel_sub_group_block_read_ui2(
        const __global uint* p )
uint4 intel_sub_group_block_read_ui4(
        const __global uint* p )
uint8 intel_sub_group_block_read_ui8(
        const __global uint* p )

Reads 1, 2, 4, or 8 uints of data for each work item in the sub-group from the specified pointer as a block operation…

uint  intel_sub_group_block_read(
        image2d_t image,
        int2 byte_coord )
uint2 intel_sub_group_block_read2(
        image2d_t image,
        int2 byte_coord )
uint4 intel_sub_group_block_read4(
        image2d_t image,
        int2 byte_coord )
uint8 intel_sub_group_block_read8(
        image2d_t image,
        int2 byte_coord )

uint  intel_sub_group_block_read_ui(
        image2d_t image,
        int2 byte_coord )
uint2 intel_sub_group_block_read_ui2(
        image2d_t image,
        int2 byte_coord )
uint4 intel_sub_group_block_read_ui4(
        image2d_t image,
        int2 byte_coord )
uint8 intel_sub_group_block_read_ui8(
        image2d_t image,
        int2 byte_coord )

Reads 1, 2, 4, or 8 uints of data for each work item in the sub-group from the specified image at the specified coordinate as a block operation…

void  intel_sub_group_block_write(
        __global uint* p, uint data )
void  intel_sub_group_block_write2(
        __global uint* p, uint2 data )
void  intel_sub_group_block_write4(
        __global uint* p, uint4 data )
void  intel_sub_group_block_write8(
        __global uint* p, uint8 data )

void  intel_sub_group_block_write_ui(
        __global uint* p, uint data )
void  intel_sub_group_block_write_ui2(
        __global uint* p, uint2 data )
void  intel_sub_group_block_write_ui4(
        __global uint* p, uint4 data )
void  intel_sub_group_block_write_ui8(
        __global uint* p, uint8 data )

Writes 1, 2, 4, or 8 uints of data for each work item in the sub-group to the specified pointer as a block operation…

void  intel_sub_group_block_write(
        image2d_t image,
        int2 byte_coord, uint data )
void  intel_sub_group_block_write2(
        image2d_t image,
        int2 byte_coord, uint2 data )
void  intel_sub_group_block_write4(
        image2d_t image,
        int2 byte_coord, uint4 data )
void  intel_sub_group_block_write8(
        image2d_t image,
        int2 byte_coord, uint8 data )

void  intel_sub_group_block_write_ui(
        image2d_t image,
        int2 byte_coord, uint data )
void  intel_sub_group_block_write_ui2(
        image2d_t image,
        int2 byte_coord, uint2 data )
void  intel_sub_group_block_write_ui4(
        image2d_t image,
        int2 byte_coord, uint4 data )
void  intel_sub_group_block_write_ui8(
        image2d_t image,
        int2 byte_coord, uint8 data )

Writes 1, 2, 4, or 8 uints of data for each work item in the sub-group to the specified image at the specified coordinate as a block operation…

Also, add ulong variants of the block read and write functions. In the descriptions of these functions, the "note below describing out-of-bounds behavior" is in the cl_intel_subgroups extension specification:

Function Description

Function	Description
`ulong intel_sub_group_block_read_ul( const __global ulong* p ) ulong2 intel_sub_group_block_read_ul2( const __global ulong* p ) ulong4 intel_sub_group_block_read_ul4( const __global ulong* p ) ulong8 intel_sub_group_block_read_ul8( const __global ulong* p )`	Reads 1, 2, 4, or 8 ulongs of data for each work item in the sub-group from the specified pointer as a block operation. The data is read strided, so the first value read is: `p[ sub_group_local_id ]` and the second value read is: `p[ sub_group_local_id + max_sub_group_size ]` etc. There is no defined out-of-range behavior for these functions.
`ulong intel_sub_group_block_read_ul( image2d_t image, int2 byte_coord ) ulong2 intel_sub_group_block_read_ul2( image2d_t image, int2 byte_coord ) ulong4 intel_sub_group_block_read_ul4( image2d_t image, int2 byte_coord ) ulong8 intel_sub_group_block_read_ul8( image2d_t image, int2 byte_coord )`	Reads 1, 2, 4, or 8 ulongs of data for each work item in the sub-group from the specified image at the specified coordinate as a block operation. Note that the coordinate is a byte coordinate, not an image element coordinate. Also note that the image data is read without format conversion, so each work item may read multiple image elements (for images with element size smaller than 64-bits). The data is read row-by-row, so the first value read is from the row specified in the y-component of the provided byte_coord, the second value is read from the y-component of the provided byte_coord plus one, etc. Please see the note below describing out-of-bounds behavior for these functions.
`void intel_sub_group_block_write_ul( __global ulong* p, ulong data ) void intel_sub_group_block_write_ul2( __global ulong* p, ulong2 data ) void intel_sub_group_block_write_ul4( __global ulong* p, ulong4 data ) void intel_sub_group_block_write_ul8( __global ulong* p, ulong8 data )`	Writes 1, 2, 4, 8, or 16 ulongs of data for each work item in the sub-group to the specified pointer as a block operation. The data is written strided, so the first value is written to: `p[ sub_group_local_id ]` and the second value is written to: `p[ sub_group_local_id + max_sub_group_size ]` etc. p must be aligned to a 128-bit (16-byte) boundary. There is no defined out-of-range behavior for these functions.
`void intel_sub_group_block_write_ul( image2d_t image, int2 byte_coord, ulong data ) void intel_sub_group_block_write_ul2( image2d_t image, int2 byte_coord, ulong2 data ) void intel_sub_group_block_write_ul4( image2d_t image, int2 byte_coord, ulong4 data ) void intel_sub_group_block_write_ul8( image2d_t image, int2 byte_coord, ulong8 data )`	Writes 1, 2, 4, or 8 ulongs of data for each work item in the sub-group to the specified image at the specified coordinate as a block operation. Note that the coordinate is a byte coordinate, not an image element coordinate. Unlike the image block read function, which may read from any arbitrary byte offset, the x-component of the byte coordinate for the image block write functions must be a multiple of four; in other words, the write must begin at 32-bit boundary. There is no restriction on the y-component of the coordinate. Also, note that the image data is written without format conversion, so each work item may write multiple image elements (for images with element size smaller than 64-bits). The data is written row-by-row, so the first value written is from the row specified by the y-component of the provided byte_coord, the second value is written from the y-component of the provided byte_coord plus one, etc. Please see the note below describing out-of-bounds behavior for these functions.

ulong   intel_sub_group_block_read_ul(
          const __global ulong* p )
ulong2  intel_sub_group_block_read_ul2(
          const __global ulong* p )
ulong4  intel_sub_group_block_read_ul4(
          const __global ulong* p )
ulong8  intel_sub_group_block_read_ul8(
          const __global ulong* p )

Reads 1, 2, 4, or 8 ulongs of data for each work item in the sub-group from the specified pointer as a block operation. The data is read strided, so the first value read is:

p[ sub_group_local_id ]

and the second value read is:

p[ sub_group_local_id + max_sub_group_size ]

etc.

There is no defined out-of-range behavior for these functions.

ulong   intel_sub_group_block_read_ul(
          image2d_t image,
          int2 byte_coord )
ulong2  intel_sub_group_block_read_ul2(
          image2d_t image,
          int2 byte_coord )
ulong4  intel_sub_group_block_read_ul4(
          image2d_t image,
          int2 byte_coord )
ulong8  intel_sub_group_block_read_ul8(
          image2d_t image,
          int2 byte_coord )

Reads 1, 2, 4, or 8 ulongs of data for each work item in the sub-group from the specified image at the specified coordinate as a block operation. Note that the coordinate is a byte coordinate, not an image element coordinate. Also note that the image data is read without format conversion, so each work item may read multiple image elements (for images with element size smaller than 64-bits).

The data is read row-by-row, so the first value read is from the row specified in the y-component of the provided byte_coord, the second value is read from the y-component of the provided byte_coord plus one, etc.

Please see the note below describing out-of-bounds behavior for these functions.

void  intel_sub_group_block_write_ul(
        __global ulong* p, ulong data )
void  intel_sub_group_block_write_ul2(
        __global ulong* p, ulong2 data )
void  intel_sub_group_block_write_ul4(
        __global ulong* p, ulong4 data )
void  intel_sub_group_block_write_ul8(
        __global ulong* p, ulong8 data )

Writes 1, 2, 4, 8, or 16 ulongs of data for each work item in the sub-group to the specified pointer as a block operation. The data is written strided, so the first value is written to:

p[ sub_group_local_id ]

and the second value is written to:

p[ sub_group_local_id + max_sub_group_size ]

etc.

p must be aligned to a 128-bit (16-byte) boundary.

There is no defined out-of-range behavior for these functions.

void  intel_sub_group_block_write_ul(
        image2d_t image,
        int2 byte_coord, ulong data )
void  intel_sub_group_block_write_ul2(
        image2d_t image,
        int2 byte_coord, ulong2 data )
void  intel_sub_group_block_write_ul4(
        image2d_t image,
        int2 byte_coord, ulong4 data )
void  intel_sub_group_block_write_ul8(
        image2d_t image,
        int2 byte_coord, ulong8 data )

Writes 1, 2, 4, or 8 ulongs of data for each work item in the sub-group to the specified image at the specified coordinate as a block operation. Note that the coordinate is a byte coordinate, not an image element coordinate. Unlike the image block read function, which may read from any arbitrary byte offset, the x-component of the byte coordinate for the image block write functions must be a multiple of four; in other words, the write must begin at 32-bit boundary. There is no restriction on the y-component of the coordinate. Also, note that the image data is written without format conversion, so each work item may write multiple image elements (for images with element size smaller than 64-bits).

The data is written row-by-row, so the first value written is from the row specified by the y-component of the provided byte_coord, the second value is written from the y-component of the provided byte_coord plus one, etc.

Please see the note below describing out-of-bounds behavior for these functions.

Modifications to the OpenCL SPIR-V Environment Specification

The section numbers below refer to sections added by the cl_intel_spirv_subgroups extension.

Note that the restrictions described in Section 7.1.X.3 - Notes and Restrictions in the cl_intel_spirv_subgroups extension are unchanged and continue to apply for this extension.

Add to Section 7.1.X.2 - Block IO Instructions

Add to the description of supported types in this section:

Additionally, if the OpenCL environment supports the extension cl_intel_subgroups_long:

Scalars and OpTypeVectors with 2, 4, or 8 Component Count components of the following Component Type types:
- OpTypeInt with a Width of 64 bits and Signedness of 0 (equivalent to long and ulong)

Issues

None.

Revision History

Rev	Date	Author	Changes
1	2020-03-13	Ben Ashbaugh	First public revision.

Rev

Date

Author

Changes

2020-03-13

Ben Ashbaugh

First public revision.