Name Strings
cl_intel_subgroups_short
Contact
Ben Ashbaugh, Intel (ben 'dot' ashbaugh 'at' intel 'dot' com)
Contributors
Ben Ashbaugh, Intel
Felix J Degrood, Intel
Biju George, Intel
Raun M Krisch, Intel
Konstantin A Pyjov, Intel
Insoo Woo, Intel
Notice
Copyright (c) 2018-2019 Intel Corporation. All rights reserved.
Status
Final Draft
Version
Built On: 2019-10-23
Revision: 3
Dependencies
OpenCL 1.2 and support for cl_intel_subgroups
is required.
This extension is written against the OpenCL API Specification Version 2.2 (revision v2.2-7), against the OpenCL C Language Specification Version 2.0 (revision v2.2-7), and against version 4 of the cl_intel_subgroups
specification.
Overview
The goal of this extension is to allow programmers to improve the performance of applications operating on 16-bit data types by extending the subgroup functions described in the cl_intel_subgroups
extension to support 16-bit integer data types (shorts
and ushorts
).
Specifically, the extension:
-
Extends the subgroup broadcast function to allow 16-bit integer values to be broadcast from one work item to all other work items in the subgroup.
-
Extends the subgroup scan and reduction functions to operate on 16-bit integer data types.
-
Extends the Intel subgroup shuffle functions to allow arbitrarily exchanging 16-bit integer values among work items in the subgroup.
-
Extends the Intel subgroup block read and write functions to allow reading and writing 16-bit integer data from images and buffers.
New API Functions
None.
New API Enums
None.
New OpenCL C Functions
- Add
short
andushort
to the list of supported data types for the subgroup broadcast, scan, and reduction functions: -
short intel_sub_group_broadcast( short x, uint sub_group_local_id ) ushort intel_sub_group_broadcast( ushort x, uint sub_group_local_id ) short intel_sub_group_reduce_add( short x ) ushort intel_sub_group_reduce_add( ushort x ) short intel_sub_group_reduce_min( short x ) ushort intel_sub_group_reduce_min( ushort x ) short intel_sub_group_reduce_max( short x ) ushort intel_sub_group_reduce_max( ushort x ) short intel_sub_group_scan_exclusive_add( short x ) ushort intel_sub_group_scan_exclusive_add( ushort x ) short intel_sub_group_scan_exclusive_min( short x ) ushort intel_sub_group_scan_exclusive_min( ushort x ) short intel_sub_group_scan_exclusive_max( short x ) ushort intel_sub_group_scan_exclusive_max( ushort x ) short intel_sub_group_scan_inclusive_add( short x ) ushort intel_sub_group_scan_inclusive_add( ushort x ) short intel_sub_group_scan_inclusive_min( short x ) ushort intel_sub_group_scan_inclusive_min( ushort x ) short intel_sub_group_scan_inclusive_max( short x ) ushort intel_sub_group_scan_inclusive_max( ushort x )
- Add
short
,short2
,short3
,short4
,short8
,short16
,ushort
,ushort2
,ushort3
,ushort4
,ushort8
, andushort16
to the list ofgentype
data types supported by thesub_group_shuffle
,sub_group_shuffle_down
,sub_group_shuffle_up
, andsub_group_shuffle_xor
functions: -
gentype intel_sub_group_shuffle( gentype data, uint c ) gentype intel_sub_group_shuffle_down( gentype current, gentype next, uint delta ) gentype intel_sub_group_shuffle_up( gentype previous, gentype current, uint delta ) gentype intel_sub_group_shuffle_xor( gentype data, uint value )
- Add
ushort
variants of the subgroup block read and write functions: -
ushort intel_sub_group_block_read_us( const __global ushort* p ) ushort2 intel_sub_group_block_read_us2( const __global ushort* p ) ushort4 intel_sub_group_block_read_us4( const __global ushort* p ) ushort8 intel_sub_group_block_read_us8( const __global ushort* p ) ushort intel_sub_group_block_read_us( image2d_t image, int2 byte_coord ) ushort2 intel_sub_group_block_read_us2( image2d_t image, int2 byte_coord ) ushort4 intel_sub_group_block_read_us4( image2d_t image, int2 byte_coord ) ushort8 intel_sub_group_block_read_us8( image2d_t image, int2 byte_coord ) void intel_sub_group_block_write_us( __global ushort* p, ushort data ) void intel_sub_group_block_write_us2( __global ushort* p, ushort2 data ) void intel_sub_group_block_write_us4( __global ushort* p, ushort4 data ) void intel_sub_group_block_write_us8( __global ushort* p, ushort8 data ) void intel_sub_group_block_write_us( image2d_t image, int2 byte_coord, ushort data ) void intel_sub_group_block_write_us2( image2d_t image, int2 byte_coord, ushort2 data ) void intel_sub_group_block_write_us4( image2d_t image, int2 byte_coord, ushort4 data ) void intel_sub_group_block_write_us8( image2d_t image, int2 byte_coord, ushort8 data )
- For naming consistency, also add suffixed aliases of the
uint
subgroup block read and write functions described in thecl_intel_subgroups
extension: -
uint intel_sub_group_block_read_ui( const __global uint* p ) uint2 intel_sub_group_block_read_ui2( const __global uint* p ) uint4 intel_sub_group_block_read_ui4( const __global uint* p ) uint8 intel_sub_group_block_read_ui8( const __global uint* p ) uint intel_sub_group_block_read_ui( image2d_t image, int2 byte_coord ) uint2 intel_sub_group_block_read_ui2( image2d_t image, int2 byte_coord ) uint4 intel_sub_group_block_read_ui4( image2d_t image, int2 byte_coord ) uint8 intel_sub_group_block_read_ui8( image2d_t image, int2 byte_coord ) void intel_sub_group_block_write_ui( __global uint* p, uint data ) void intel_sub_group_block_write_ui2( __global uint* p, uint2 data ) void intel_sub_group_block_write_ui4( __global uint* p, uint4 data ) void intel_sub_group_block_write_ui8( __global uint* p, uint8 data ) void intel_sub_group_block_write_ui( image2d_t image, int2 byte_coord, uint data ) void intel_sub_group_block_write_ui2( image2d_t image, int2 byte_coord, uint2 data ) void intel_sub_group_block_write_ui4( image2d_t image, int2 byte_coord, uint4 data ) void intel_sub_group_block_write_ui8( image2d_t image, int2 byte_coord, uint8 data )
Modifications to the OpenCL C Specification
Additions to Section 6.13.15 - "Work Group Functions"
- Add
short
andushort
to the list of supported data types for the subgroup broadcast, scan, and reduction functions: -
Function Description gentype sub_group_broadcast( gentype x, uint sub_group_local_id ) short intel_sub_group_broadcast( short x, uint sub_group_local_id ) ushort intel_sub_group_broadcast( ushort x, uint sub_group_local_id )
Broadcasts the value of x for work item identified by sub_group_local_id (value returned by get_sub_group_local_id) to all work items in the subgroup. sub_group_local_id must be the same value for all work items in the subgroup.
gentype sub_group_reduce_add( gentype x ) gentype sub_group_reduce_min( gentype x ) gentype sub_group_reduce_max( gentype x ) short intel_sub_group_reduce_add( short x ) ushort intel_sub_group_reduce_add( ushort x ) short intel_sub_group_reduce_min( short x ) ushort intel_sub_group_reduce_min( ushort x ) short intel_sub_group_reduce_max( short x ) ushort intel_sub_group_reduce_max( ushort x )
Returns the result of the specified reduction operation for all values of x specified by work items in a subgroup.
gentype sub_group_scan_exclusive_add( gentype x ) gentype sub_group_scan_exclusive_min( gentype x ) gentype sub_group_scan_exclusive_max( gentype x ) short intel_sub_group_scan_exclusive_add( short x ) ushort intel_sub_group_scan_exclusive_add( ushort x ) short intel_sub_group_scan_exclusive_min( short x ) ushort intel_sub_group_scan_exclusive_min( ushort x ) short intel_sub_group_scan_exclusive_max( short x ) ushort intel_sub_group_scan_exclusive_max( ushort x )
Performs the specified exclusive scan operation of all values x specified by work items in a subgroup. The scan results are returned for each work item.
The scan order is defined by increasing subgroup local ID within the subgroup.
gentype sub_group_scan_inclusive_add( gentype x) gentype sub_group_scan_inclusive_min( gentype x) gentype sub_group_scan_inclusive_max( gentype x) short intel_sub_group_scan_inclusive_add( short x ) ushort intel_sub_group_scan_inclusive_add( ushort x ) short intel_sub_group_scan_inclusive_min( short x ) ushort intel_sub_group_scan_inclusive_min( ushort x ) short intel_sub_group_scan_inclusive_max( short x ) ushort intel_sub_group_scan_inclusive_max( ushort x )
Performs the specified inclusive scan operation of all values x specified by work items in a subgroup. The scan results are returned for each work item.
The scan order is defined by increasing subgroup local ID within the subgroup.
Additions to Section 6.13.X - "Sub Group Shuffle Functions"
This section was added by the cl_intel_subgroups
extension.
- Add
short
,short2
,short3
,short4
,short8
,short16
,ushort
,ushort2
,ushort3
,ushort4
,ushort8
, andushort16
to the list of data types supported by thesub_group_shuffle
,sub_group_shuffle_down
,sub_group_shuffle_up
, andsub_group_shuffle_xor
functions: -
The OpenCL C programming language implements the following built-in functions to allow data to be exchanged among work items in a subgroup. These built-in functions need not be encountered by all work items in a subgroup executing the kernel, however, data may only be shuffled among work items encountering the subgroup shuffle function. Shuffling data from a work item that does not encounter the subgroup shuffle function will produce undefined results. For these functions,
gentype
isfloat
,float2
,float3
,float4
,float8
,float16
,short
,short2
,short3
,short4
,short8
,short16
,ushort
,ushort2
,ushort3
,ushort4
,ushort8
,ushort16
,int
,int2
,int3
,int4
,int8
,int16
,uint
,uint2
,uint3
,uint4
,uint8
,uint16
,long
, orulong
.If
cl_khr_fp16
is supported,gentype
also includeshalf
.If
cl_khr_fp64
or doubles are supported,gentype
also includesdouble
.
Modifications to Section 6.13.X "Sub Group Read and Write Functions"
This section was added by the cl_intel_subgroups
extension.
- Add suffixed aliases of the previously un-suffixed 32-bit block read and write functions. There is no change to the description or behavior of these functions:
-
Function Description uint intel_sub_group_block_read( const __global uint* p ) uint2 intel_sub_group_block_read2( const __global uint* p ) uint4 intel_sub_group_block_read4( const __global uint* p ) uint8 intel_sub_group_block_read8( const __global uint* p ) uint intel_sub_group_block_read_ui( const __global uint* p ) uint2 intel_sub_group_block_read_ui2( const __global uint* p ) uint4 intel_sub_group_block_read_ui4( const __global uint* p ) uint8 intel_sub_group_block_read_ui8( const __global uint* p )
Reads 1, 2, 4, or 8 uints of data for each work item in the subgroup from the specified pointer as a block operation…
uint intel_sub_group_block_read( image2d_t image, int2 byte_coord ) uint2 intel_sub_group_block_read2( image2d_t image, int2 byte_coord ) uint4 intel_sub_group_block_read4( image2d_t image, int2 byte_coord ) uint8 intel_sub_group_block_read8( image2d_t image, int2 byte_coord ) uint intel_sub_group_block_read_ui( image2d_t image, int2 byte_coord ) uint2 intel_sub_group_block_read_ui2( image2d_t image, int2 byte_coord ) uint4 intel_sub_group_block_read_ui4( image2d_t image, int2 byte_coord ) uint8 intel_sub_group_block_read_ui8( image2d_t image, int2 byte_coord )
Reads 1, 2, 4, or 8 uints of data for each work item in the subgroup from the specified image at the specified coordinate as a block operation…
void intel_sub_group_block_write( __global uint* p, uint data ) void intel_sub_group_block_write2( __global uint* p, uint2 data ) void intel_sub_group_block_write4( __global uint* p, uint4 data ) void intel_sub_group_block_write8( __global uint* p, uint8 data ) void intel_sub_group_block_write_ui( __global uint* p, uint data ) void intel_sub_group_block_write_ui2( __global uint* p, uint2 data ) void intel_sub_group_block_write_ui4( __global uint* p, uint4 data ) void intel_sub_group_block_write_ui8( __global uint* p, uint8 data )
Writes 1, 2, 4, or 8 uints of data for each work item in the subgroup to the specified pointer as a block operation…
void intel_sub_group_block_write( image2d_t image, int2 byte_coord, uint data ) void intel_sub_group_block_write2( image2d_t image, int2 byte_coord, uint2 data ) void intel_sub_group_block_write4( image2d_t image, int2 byte_coord, uint4 data ) void intel_sub_group_block_write8( image2d_t image, int2 byte_coord, uint8 data ) void intel_sub_group_block_write_ui( image2d_t image, int2 byte_coord, uint data ) void intel_sub_group_block_write_ui2( image2d_t image, int2 byte_coord, uint2 data ) void intel_sub_group_block_write_ui4( image2d_t image, int2 byte_coord, uint4 data ) void intel_sub_group_block_write_ui8( image2d_t image, int2 byte_coord, uint8 data )
Writes 1, 2, 4, or 8 uints of data for each work item in the subgroup to the specified image at the specified coordinate as a block operation…
- Also, add
ushort
variants of the block read and write functions. In the descriptions of these functions, the "note below describing out-of-bounds behavior" is in thecl_intel_subgroups
extension specification: -
Function Description ushort intel_sub_group_block_read_us( const __global ushort* p ) ushort2 intel_sub_group_block_read_us2( const __global ushort* p ) ushort4 intel_sub_group_block_read_us4( const __global ushort* p ) ushort8 intel_sub_group_block_read_us8( const __global ushort* p )
Reads 1, 2, 4, or 8 ushorts of data for each work item in the subgroup from the specified pointer as a block operation. The data is read strided, so the first value read is:
p[ sub_group_local_id ]
and the second value read is:
p[ sub_group_local_id + max_sub_group_size ]
etc.
p must be aligned to a 32-bit (4-byte) boundary.
There is no defined out-of-range behavior for these functions.
ushort intel_sub_group_block_read_us( image2d_t image, int2 byte_coord ) ushort2 intel_sub_group_block_read_us2( image2d_t image, int2 byte_coord ) ushort4 intel_sub_group_block_read_us4( image2d_t image, int2 byte_coord ) ushort8 intel_sub_group_block_read_us8( image2d_t image, int2 byte_coord )
Reads 1, 2, 4, or 8 ushorts of data for each work item in the subgroup from the specified image at the specified coordinate as a block operation. Note that the coordinate is a byte coordinate, not an image element coordinate. Also note that the image data is read without format conversion, so each work item may read multiple image elements (for images with element size smaller than 16-bits).
The data is read row-by-row, so the first value read is from the row specified in the y-component of the provided byte_coord, the second value is read from the y-component of the provided byte_coord plus one, etc.
Please see the note below describing out-of-bounds behavior for these functions.
void intel_sub_group_block_write_us( __global ushort* p, ushort data ) void intel_sub_group_block_write_us2( __global ushort* p, ushort2 data ) void intel_sub_group_block_write_us4( __global ushort* p, ushort4 data ) void intel_sub_group_block_write_us8( __global ushort* p, ushort8 data )
Writes 1, 2, 4, or 8 ushorts of data for each work item in the subgroup to the specified pointer as a block operation. The data is written strided, so the first value is written to:
p[ sub_group_local_id ]
and the second value is written to:
p[ sub_group_local_id + max_sub_group_size ]
etc.
p must be aligned to a 128-bit (16-byte) boundary.
There is no defined out-of-range behavior for these functions.
void intel_sub_group_block_write_us( image2d_t image, int2 byte_coord, ushort data ) void intel_sub_group_block_write_us2( image2d_t image, int2 byte_coord, ushort2 data ) void intel_sub_group_block_write_us4( image2d_t image, int2 byte_coord, ushort4 data ) void intel_sub_group_block_write_us8( image2d_t image, int2 byte_coord, ushort8 data )
Writes 1, 2, 4, or 8 ushorts of data for each work item in the subgroup to the specified image at the specified coordinate as a block operation. Note that the coordinate is a byte coordinate, not an image element coordinate. Unlike the image block read function, which may read from any arbitrary byte offset, the x-component of the byte coordinate for the image block write functions must be a multiple of four; in other words, the write must begin at 32-bit boundary. There is no restriction on the y-component of the coordinate. Also, note that the image data is written without format conversion, so each work item may write multiple image elements (for images with element size smaller than 8-bits).
The data is written row-by-row, so the first value written is from the row specified by the y-component of the provided byte_coord, the second value is written from the y-component of the provided byte_coord plus one, etc.
Please see the note below describing out-of-bounds behavior for these functions.
Issues
None.
Revision History
Rev | Date | Author | Changes |
---|---|---|---|
1 |
2016-10-20 |
Ben Ashbaugh |
First public revision. |
2 |
2018-11-15 |
Ben Ashbaugh |
Conversion to asciidoc. |
3 |
2019-09-17 |
Ben Ashbaugh |
Added vec3 types for shuffles and asciidoctor formatting fixes. |