Description
The __kernel qualifier can be used with the keyword attribute to
declare additional information about the kernel function as described below.
The optional __attribute__((vec_type_hint(<type>)))
[1] is a hint to the compiler and is intended to be a
representation of the computational width of the __kernel, and should
serve as the basis for calculating processor bandwidth utilization when the
compiler is looking to autovectorize the code.
In the __attribute__((vec_type_hint(<type>))) qualifier <type> is one of
the built-in vector types listed in Built-in Vector Data Types or the
constituent scalar element types.
If vec_type_hint (<type>) is not specified, the kernel is assumed to have
the __attribute__((vec_type_hint(int))) qualifier.
For example, where the developer specified a width of float4, the compiler
should assume that the computation usually uses up to 4 lanes of a float
vector, and would decide to merge work-items or possibly even separate one
work-item into many threads to better match the hardware capabilities.
A conforming implementation is not required to autovectorize code, but shall
support the hint.
A compiler may autovectorize, even if no hint is provided.
If an implementation merges N work-items into one thread, it is responsible
for correctly handling cases where the number of global or local
work-items in any dimension modulo N is not zero.
Examples:
// autovectorize assuming float4 as the
// basic computation width
__kernel __attribute__((vec_type_hint(float4)))
void foo( __global float4 *p ) { ... }
// autovectorize assuming double as the
// basic computation width
__kernel __attribute__((vec_type_hint(double)))
void foo( __global float4 *p ) { ... }
// autovectorize assuming int (default)
// as the basic computation width
__kernel
void foo( __global float4 *p ) { ... }
If for example, a __kernel function is declared with
-
__attribute__(( vec_type_hint (float4)))
(meaning that most operations in the __kernel function are explicitly
vectorized using float4) and the kernel is running using Intel®
Advanced Vector Instructions (Intel® AVX) which implements a
8-float-wide vector unit, the autovectorizer might choose to merge two
work-items to one thread, running a second work-item in the high half of the
256-bit AVX register.
As another example, a Power4 machine has two scalar double-precision
floating-point units with an 6-cycle deep pipe.
An autovectorizer for the Power4 machine might choose to interleave six
kernels declared with the __attribute__(( vec_type_hint (double2)))
qualifier into one hardware thread, to ensure that there is always 12-way
parallelism available to saturate the FPUs.
It might also choose to merge 4 or 8 work-items (or some other number) if it
concludes that these are better choices, due to resource utilization
concerns or some preference for divisibility by 2.
The optional __attribute__((work_group_size_hint(X, Y, Z))) is a hint to
the compiler and is intended to specify the work-group size that may be used
i.e. value most likely to be specified by the local_work_size argument to
clEnqueueNDRangeKernel.
For example, the __attribute__((work_group_size_hint(1, 1, 1))) is a
hint to the compiler that the kernel will most likely be executed with a
work-group size of 1.
The optional __attribute__((reqd_work_group_size(X, Y, Z))) is the
work-group size that must be used as the local_work_size argument to
clEnqueueNDRangeKernel.
This allows the compiler to optimize the generated code appropriately for
this kernel.
If Z is one, the work_dim argument to clEnqueueNDRangeKernel can be 2
or 3.
If Y and Z are one, the work_dim argument to clEnqueueNDRangeKernel
can be 1, 2 or 3.
Document Notes
For more information, see the OpenCL C Specification
This page is extracted from the OpenCL C Specification. Fixes and changes should be made to the Specification, not directly.