cl_arm_scheduling

Name Strings

cl_arm_scheduling_controls

Contact

Kevin Petit (kevin.petit 'at' arm.com)

Contributors

Kevin Petit, Arm Ltd.
Radek Szymanski, Arm Ltd.

Notice

Status

Shipping

Version

Built On: 2023-08-08
Version: 0.6.0

Dependencies

This extension is written against the OpenCL Specification v3.0.3.

This extension requires OpenCL 2.0.

Overview

This extension gives applications explicit control over some aspects of work scheduling. It can be used for performance tuning or debugging.

New API Enums

Accepted value for the param_name parameter to clGetDeviceInfo:

CL_DEVICE_SCHEDULING_CONTROLS_CAPABILITIES_ARM           0x41E4

CL_DEVICE_SCHEDULING_KERNEL_BATCHING_ARM                (1 << 0)
CL_DEVICE_SCHEDULING_WORKGROUP_BATCH_SIZE_ARM           (1 << 1)
CL_DEVICE_SCHEDULING_WORKGROUP_BATCH_SIZE_MODIFIER_ARM  (1 << 2)
CL_DEVICE_SCHEDULING_DEFERRED_FLUSH_ARM                 (1 << 3)
CL_DEVICE_SCHEDULING_REGISTER_ALLOCATION_ARM            (1 << 4)
CL_DEVICE_SCHEDULING_WARP_THROTTLING_ARM                (1 << 5)
CL_DEVICE_SCHEDULING_COMPUTE_UNIT_BATCH_QUEUE_SIZE_ARM  (1 << 6)
CL_DEVICE_SCHEDULING_COMPUTE_UNIT_LIMIT_ARM             (1 << 7)

CL_DEVICE_SUPPORTED_REGISTER_ALLOCATIONS_ARM              0x41EB

CL_DEVICE_MAX_WARP_COUNT_ARM                             0x41EA

Accepted value for the param_name parameter to clSetKernelExecInfo:

CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM            0x41E5
CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_MODIFIER_ARM   0x41E6
CL_KERNEL_EXEC_INFO_WARP_COUNT_LIMIT_ARM                0x41E8
CL_KERNEL_EXEC_INFO_COMPUTE_UNIT_MAX_QUEUED_BATCHES_ARM 0x41F1

Accepted value for the properties parameter to clCreateCommandQueueWithProperties:

CL_QUEUE_KERNEL_BATCHING_ARM    0x41E7
CL_QUEUE_DEFERRED_FLUSH_ARM     0x41EC
CL_QUEUE_COMPUTE_UNIT_LIMIT_ARM 0x41F3

Accepted value for the param_name parameter to clGetKernelInfo:

CL_KERNEL_MAX_WARP_COUNT_ARM 0x41E9

New build options

This extension adds a build option to control the number of registers allocated to each thread:

-fregister-allocation=

Missing before version 0.3.0

Modifications to the OpenCL API Specification

(Modify Section 4.2, Querying Devices)

(Add the following to Table 5, Device Queries)

cl_device_info Return Type Description

cl_device_info	Return Type	Description
`CL_DEVICE_SCHEDULING_CONTROLS_CAPABILITIES_ARM`	`cl_device_scheduling_controls_capabilities_arm`	Returns a bitfield of the scheduling controls this device supports: - `CL_DEVICE_SCHEDULING_KERNEL_BATCHING_ARM` is set when the device supports `CL_QUEUE_KERNEL_BATCHING_ARM`. - `CL_DEVICE_SCHEDULING_WORKGROUP_BATCH_SIZE_ARM` is set when the device supports `CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM`. - `CL_DEVICE_SCHEDULING_WORKGROUP_BATCH_SIZE_MODIFIER_ARM` is set when the device supports `CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_MODIFIER_ARM`. - `CL_DEVICE_SCHEDULING_DEFERRED_FLUSH_ARM` is set when the device supports `CL_QUEUE_DEFERRED_FLUSH_ARM`. - `CL_DEVICE_SCHEDULING_REGISTER_ALLOCATION_ARM` is set when the device compiler supports the `-fregister-allocation` option. - `CL_DEVICE_SCHEDULING_WARP_THROTTLING_ARM` is set when the device supports `CL_KERNEL_EXEC_INFO_WARP_COUNT_LIMIT_ARM`, `CL_KERNEL_MAX_WARP_COUNT_ARM` and `CL_DEVICE_MAX_WARP_COUNT_ARM`. Missing before version 0.4. - `CL_DEVICE_SCHEDULING_COMPUTE_UNIT_BATCH_QUEUE_SIZE_ARM` is set when the device supports `CL_KERNEL_EXEC_INFO_COMPUTE_UNIT_MAX_QUEUED_BATCHES_ARM`. Missing before version 0.5. - `CL_DEVICE_SCHEDULING_COMPUTE_UNIT_LIMIT_ARM` is set when the device supports `CL_QUEUE_COMPUTE_UNIT_LIMIT_ARM`. Missing before version 0.6.
`CL_DEVICE_SUPPORTED_REGISTER_ALLOCATIONS_ARM`	`cl_int[]`	Returns an array of valid register allocations for this device. Each of the returned values can be passed to the `-fregister-allocation` build option. Missing before version 0.3.
`CL_DEVICE_MAX_WARP_COUNT_ARM`	`cl_uint`	Returns the maximum number of warps per compute unit a kernel may use. The value returned is an upper bound for any possible kernel. When `CL_DEVICE_SCHEDULING_CONTROLS_CAPABILITIES_ARM` is not set, the call to clGetDeviceInfo returns `CL_INVALID_VALUE`. Missing before version 0.4.

CL_DEVICE_SCHEDULING_CONTROLS_CAPABILITIES_ARM

cl_device_scheduling_controls_capabilities_arm

Returns a bitfield of the scheduling controls this device supports:
- CL_DEVICE_SCHEDULING_KERNEL_BATCHING_ARM is set when the device supports CL_QUEUE_KERNEL_BATCHING_ARM.
- CL_DEVICE_SCHEDULING_WORKGROUP_BATCH_SIZE_ARM is set when the device supports CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM.
- CL_DEVICE_SCHEDULING_WORKGROUP_BATCH_SIZE_MODIFIER_ARM is set when the device supports CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_MODIFIER_ARM. - CL_DEVICE_SCHEDULING_DEFERRED_FLUSH_ARM is set when the device supports CL_QUEUE_DEFERRED_FLUSH_ARM.

- CL_DEVICE_SCHEDULING_REGISTER_ALLOCATION_ARM is set when the device compiler supports the -fregister-allocation option.

- CL_DEVICE_SCHEDULING_WARP_THROTTLING_ARM is set when the device supports CL_KERNEL_EXEC_INFO_WARP_COUNT_LIMIT_ARM, CL_KERNEL_MAX_WARP_COUNT_ARM and CL_DEVICE_MAX_WARP_COUNT_ARM.
Missing before version 0.4.

- CL_DEVICE_SCHEDULING_COMPUTE_UNIT_BATCH_QUEUE_SIZE_ARM is set when the device supports CL_KERNEL_EXEC_INFO_COMPUTE_UNIT_MAX_QUEUED_BATCHES_ARM.
Missing before version 0.5.

- CL_DEVICE_SCHEDULING_COMPUTE_UNIT_LIMIT_ARM is set when the device supports CL_QUEUE_COMPUTE_UNIT_LIMIT_ARM.
Missing before version 0.6.

CL_DEVICE_SUPPORTED_REGISTER_ALLOCATIONS_ARM

cl_int[]

Returns an array of valid register allocations for this device. Each of the returned values can be passed to the -fregister-allocation build option.
Missing before version 0.3.

CL_DEVICE_MAX_WARP_COUNT_ARM

cl_uint

Returns the maximum number of warps per compute unit a kernel may use. The value returned is an upper bound for any possible kernel. When CL_DEVICE_SCHEDULING_CONTROLS_CAPABILITIES_ARM is not set, the call to clGetDeviceInfo returns CL_INVALID_VALUE.
Missing before version 0.4.

(Modify Section 5.1, Command-Queues)

(Add the following to Table 9, List of supported queue creation properties)

Queue Properties Type Description

Queue Properties	Type	Description
`CL_QUEUE_KERNEL_BATCHING_ARM`	`cl_bool`	Whether kernels enqueued to this queue should be batched for submission to the device. `CL_TRUE` means kernels will be batched, `CL_FALSE` that they will not. Defaults to `CL_TRUE`.
`CL_QUEUE_DEFERRED_FLUSH_ARM`	`cl_bool`	Whether flush operations are performed in the thread triggering the flush or deferred for execution in another thread managed by the OpenCL runtime. `CL_TRUE` means flush operations are deferred. Defaults to `CL_TRUE`. Missing before version 0.2.
`CL_QUEUE_COMPUTE_UNIT_LIMIT_ARM`	`cl_uint`	Set a limit for the number of compute units this queue is allowed to use. The limit provided must be greater than 0 and less than or equal to `CL_DEVICE_MAX_COMPUTE_UNITS`. Missing before version 0.6.

CL_QUEUE_KERNEL_BATCHING_ARM

cl_bool

Whether kernels enqueued to this queue should be batched for submission to the device. CL_TRUE means kernels will be batched, CL_FALSE that they will not. Defaults to CL_TRUE.

CL_QUEUE_DEFERRED_FLUSH_ARM

cl_bool

Whether flush operations are performed in the thread triggering the flush or deferred for execution in another thread managed by the OpenCL runtime. CL_TRUE means flush operations are deferred. Defaults to CL_TRUE.
Missing before version 0.2.

CL_QUEUE_COMPUTE_UNIT_LIMIT_ARM

cl_uint

Set a limit for the number of compute units this queue is allowed to use. The limit provided must be greater than 0 and less than or equal to CL_DEVICE_MAX_COMPUTE_UNITS.
Missing before version 0.6.

(Modify Section 5.8.6, Compiler Options)

(Add the following to Optimization Options)

The following options are supported when building programs created from source or intermediate language. Specifying these options when building a program created from a binary will result in CL_INVALID_BUILD_OPTIONS being returned by clBuildProgram.

-fregister-allocation=<number-of-registers-per-thread>: This option overrides the compiler’s selection of the number of machine registers allocated to each thread for all kernels in the program.
-fregister-allocation=<kernel-name>:<number-of-registers-per-thread>[,…]: This option overrides the compiler’s selection of the number of machine registers allocated to each thread for specific kernels in the program.

(Modify Section 5.9, Kernel Objects)

(Add the following to Table 31, List of param_values accepted by *clSetKernelExecInfo*)

cl_kernel_exec_info Type Description

cl_kernel_exec_info	Type	Description
`CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM`	`cl_uint`	Set the size of batches of work groups distributed to compute units. The value is a number of work groups. If set to 0, then the runtime will pick a suitable value automatically. Defaults to 0. If the value is greater than the number of work groups necessary to execute a given ND-range, the actual batch size will be capped at the number of work groups in the ND-range. When a value is not directly usable due to device-specific constraints, it will be rounded up to the next usable value.
`CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_MODIFIER_ARM`	`cl_int`	Modify the size of batches of work groups distributed to compute units. On devices that support `CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM`, the value is a number of work groups added to the batch size calculated by the runtime (when `CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM` set to 0) or set by the application (when `CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM` set to a value greater than 0). On devices that do not support `CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM`, the value is a number in the range [-31,+31]. When set to 0, the runtime-selected batch size is used unmodified. When set to a non-zero value, each increment of one unit in either direction around zero will either divide (negative value) or multiply (positive value) the batch size by 2. If the requested modification is not possible due to hardware constraints, the greatest possible modification will be used.
`CL_KERNEL_EXEC_INFO_WARP_COUNT_LIMIT_ARM`	`cl_uint`	Limit the number of warps allowed to run in each compute unit for this kernel. The value passed must be greater than 0 and smaller than or equal to `CL_KERNEL_MAX_WARP_COUNT_ARM`, otherwise the call to clSetKernelExecInfo will fail and return `CL_INVALID_VALUE`. Missing before version 0.4.
`CL_KERNEL_EXEC_INFO_COMPUTE_UNIT_MAX_QUEUED_BATCHES_ARM`	`cl_uint`	Limit the number of workgroup batches each compute unit can have in its queue. Acceptable values depend on the device. If a value not supported by the device is passed, the call to clSetKernelExecInfo will fail and return `CL_INVALID_VALUE`. All devices that support this functionality accept `0` to let the implementation select the max number of queued workgroup batches. Missing before version 0.5.

CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM

cl_uint

Set the size of batches of work groups distributed to compute units. The value is a number of work groups. If set to 0, then the runtime will pick a suitable value automatically. Defaults to 0. If the value is greater than the number of work groups necessary to execute a given ND-range, the actual batch size will be capped at the number of work groups in the ND-range. When a value is not directly usable due to device-specific constraints, it will be rounded up to the next usable value.

CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_MODIFIER_ARM

cl_int

Modify the size of batches of work groups distributed to compute units.

On devices that support CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM, the value is a number of work groups added to the batch size calculated by the runtime (when CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM set to 0) or set by the application (when CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM set to a value greater than 0).

On devices that do not support CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM, the value is a number in the range [-31,+31]. When set to 0, the runtime-selected batch size is used unmodified. When set to a non-zero value, each increment of one unit in either direction around zero will either divide (negative value) or multiply (positive value) the batch size by 2. If the requested modification is not possible due to hardware constraints, the greatest possible modification will be used.

CL_KERNEL_EXEC_INFO_WARP_COUNT_LIMIT_ARM

cl_uint

Limit the number of warps allowed to run in each compute unit for this kernel.
The value passed must be greater than 0 and smaller than or equal to CL_KERNEL_MAX_WARP_COUNT_ARM, otherwise the call to clSetKernelExecInfo will fail and return CL_INVALID_VALUE.
Missing before version 0.4.

CL_KERNEL_EXEC_INFO_COMPUTE_UNIT_MAX_QUEUED_BATCHES_ARM

cl_uint

Limit the number of workgroup batches each compute unit can have in its queue.
Acceptable values depend on the device. If a value not supported by the device is passed, the call to clSetKernelExecInfo will fail and return CL_INVALID_VALUE. All devices that support this functionality accept 0 to let the implementation select the max number of queued workgroup batches.
Missing before version 0.5.

(Add the following to Table 32, List of supported param_names by *clGetKernelInfo*)

Kernel Info Return type Description

Kernel Info	Return type	Description
`CL_KERNEL_MAX_WARP_COUNT_ARM`	`cl_uint`	Returns the maximum number of warps this kernel can use per compute unit. Missing before version 0.4.

CL_KERNEL_MAX_WARP_COUNT_ARM

cl_uint

Returns the maximum number of warps this kernel can use per compute unit.
Missing before version 0.4.

Interactions with Other Extensions

Some features in this extension interact with cl_arm_thread_limit_hint.

If CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_ARM or CL_KERNEL_EXEC_INFO_WORKGROUP_BATCH_SIZE_MODIFIER_ARM is set to a non-default value at the time a kernel is enqueued, then any thread limit hint specified in the kernel source using the arm_thread_limit_hint attribute will be ignored.

Issues

None.

Revision History

Version	Date	Author	Changes
0.6.0	2022-12-12	Kévin Petit	Add support for compute unit limit
0.5.0	2022-06-29	Kévin Petit	Add support for compute unit batch size queue control
0.4.0	2022-02-28	Kévin Petit	Add support for warp throttling
0.3.0	2021-01-19	Kévin Petit	Add support for register allocation control
0.2.0	2020-09-14	Kévin Petit	Add support for deferred queue flush control
0.1.0	2020-08-28	Kévin Petit	Initial version