Extension and Version Dependencies
- 
This is a experimental extension and must be used with caution. See the description of experimental header files for enablement and stability details. 
Other Extension Metadata
- Last Modified Date
- 
2024-12-13 
- IP Status
- 
No known IP claims. 
- Contributors
- 
- 
Ewan Crawford, Codeplay Software Ltd. 
- 
Gordon Brown, Codeplay Software Ltd. 
- 
Kenneth Benzie, Codeplay Software Ltd. 
- 
Alastair Murray, Codeplay Software Ltd. 
- 
Jack Frankland, Codeplay Software Ltd. 
- 
Balaji Calidas, Qualcomm Technologies Inc. 
- 
Joshua Kelly, Qualcomm Technologies, Inc. 
- 
Kevin Petit, Arm Ltd. 
- 
Aharon Abramson, Intel. 
- 
Ben Ashbaugh, Intel. 
- 
Boaz Ouriel, Intel. 
- 
Chris Gearing, Intel. 
- 
Pekka Jääskeläinen, Tampere University and Intel 
- 
Jan Solanti, Tampere University 
- 
Nikhil Joshi, NVIDIA 
- 
James Price, Google 
- 
Brice Videau, Argonne National Laboratory 
 
- 
Description
cl_khr_ adds the ability to record and replay buffers of
OpenCL commands.
Command-buffers enable a reduction in overhead when enqueuing the same workload multiple times. By separating the command-queue setup from dispatch, the ability to replay a set of previously created commands is introduced.
Background
On embedded devices where building a command stream accounts for a significant expenditure of resources and where workloads are often required to be pipelined, a solution that minimizes driver overhead can significantly improve the utilization of accelerators by removing a bottleneck in repeated command stream generation.
An additional motivator is lowering task execution latency, as devices can be kept occupied with work by repeated submissions, without having to wait on the host to construct commands again for a similar workload.
Rationale
The command-buffer abstraction over the generation of command streams is a proven approach which facilitates a significant reduction in driver overhead in existing real-world applications with repetitive pipelined workloads which are built on top of Vulkan, DirectX 12, and Metal.
A primary goal is for a command-buffer to avoid any interaction with application code after being enqueued until all recorded commands have completed. As such, any command which maps or migrates memory objects; reads or writes memory objects; or enqueues a native kernel, is not available for command-buffer recording. Finally commands recorded into a command buffer do not wait for or return event objects, these are instead replaced with device-side synchronization-point identifiers which enable out-of-order execution of the command-buffer commands.
Adding new entry-points for individual commands, rather than recording existing command-queue APIs with begin/end markers was a design decision made for the following reasons:
- 
Individually specified entry points makes it clearer to the user what’s supported, as opposed to adding a large number of error conditions throughout the specification with all the restrictions. 
- 
Prevents code forking in existing entry points for the implementer, as otherwise separate paths in each entry point need to be maintained for both the recording and normal cases. 
- 
Allows the definition of a new device-side synchronization primitive rather than overloading cl_event. As use ofcl_eventin individual commands allows host interaction from callback and user-events, as well as introducing complexities when a command-buffer is enqueued multiple times regarding profiling and execution status.
- 
New entry points facilitate returning handles to individual commands, allowing those commands to be modified between enqueues of the command buffer. Not all command handles are used in this extension, but providing them facilitates other extensions layered on top to take advantage of them to provide additional mutable functionality. 
Command Synchronization
The command-buffer object has no in-order/out-of-order property set on creation, it is out-of-order, and command ordering is defined by the dependencies set when commands are created. Command dependencies can be defined in 3 ways:
- 
Device-side cl_sync_synchronization-points, providing an explicit list of the commands to depend on.point_ khr 
- 
Appending a clCommandBarrierWithWaitListKHR barrier command. 
- 
Passing an in-order queue when creating the command, creating an implicit dependency on the previous command created in the command-buffer using the same queue. 
clEnqueueCommandBufferKHR submissions to an out-of-order queue have the same execution semantics as other operations enqueued to an out-of-order queue, such as clEnqueueFillBuffer, where execution between enqueued operations may happen concurrently unless dependencies between the operations are expressed with events.
The cl_sync_ type is defined as a cl_uint, giving a hard
upper limit on the number of commands a command-buffer can hold as
CL_UINT_, at which point CL_OUT_ will be returned. However,
it is likely an implementation will reach capacity before this threshold is
hit.
There are no gurantees made around the values of sync-points returned from adding commands to a command-buffer. Any semantics that a could be inferred from the sync-point values returned is implementation defined.
Simultaneous Use
The optional simultaneous use capability was added to the extension so that vendors can support pipelined workflows, where command-buffers are repeatedly enqueued without blocking in user code. However, simultaneous use may result in command-buffers being more expensive to enqueue than in a sequential model, so the capability is optional to enable optimizations on command-buffer recording.
Interactions With Other Extensions
The introduction of the command-buffer abstraction enables functionality
beyond what the cl_khr_ extension currently provides, i.e.
the recording of immutable commands to a single queue which can then be
executed without commands synchronizing outside the command-buffer. Extra
functionality expanding on this is provided as layered extensions on top of
cl_khr_. The layered extensions that currently exist are:
- 
cl_khr_command_ buffer_ multi_ device 
- 
cl_khr_command_ buffer_ mutable_ dispatch 
Having cl_khr_ as a minimal base specification means that the
API defines mechanisms for functionality that is not enabled by this extension,
these are described in the following sub-sections. cl_khr_ will
retain its experimental extension status until other layered extensions are
released, as these may reveal modifications needed to the base specification to
support their intended use cases.
Command Properties
The command recording entry-points allow a properties parameter of
new type cl_command_ to be passed. No properties are
defined in cl_khr_, but the parameter enables layered
extensions to define characteristics of the individual commands.
For example, cl_khr_ defines properties
that can be set when appending a kernel command with clCommandNDRangeKernelKHR.
Command Handles
All command recording entry-points define a cl_mutable_ output
parameter which provides a handle to the specific command being recorded. Use of
these output handles is not enabled by the cl_khr_ extension,
but the handles allow individual commands in a command-buffer to be
referenced by the user.
Use of these handles is enabled in cl_khr_
to give the capability for an application to use the handles to modify commands
between enqueues of a command-buffer.
List of Queues
Only a single command-queue can be associated with a command-buffer in the
cl_khr_ extension, but the API is designed so that the layered
cl_khr_ extension can relax this constraint
to allow commands to be recorded across multiple queues in the same
command-buffer, providing replay of heterogeneous task graphs.
Using multiple queue functionality will result in an error without
cl_khr_ to relax usage of the following API
features:
- 
When a command-buffer is created the API enables passing a list of queues that the command-buffer will record commands to. Only a single queue is permitted in cl_khr_.command_ buffer 
- 
Individual command recording entry-points define a cl_command_parameter for which of the queues set on command-buffer creation that command should be record to. This must be passed as NULL inqueue cl_khr_.command_ buffer 
- 
clEnqueueCommandBufferKHR takes a list of queues for command-buffer execution, correspond to those set on creation. Only a single queue is permitted in cl_khr_.command_ buffer 
New Commands
- 
clCreateCommandBufferKHR 
- 
clRetainCommandBufferKHR 
- 
clReleaseCommandBufferKHR 
- 
clFinalizeCommandBufferKHR 
- 
clEnqueueCommandBufferKHR 
- 
clCommandBarrierWithWaitListKHR 
- 
clCommandCopyBufferKHR 
- 
clCommandCopyBufferRectKHR 
- 
clCommandCopyBufferToImageKHR 
- 
clCommandCopyImageKHR 
- 
clCommandCopyImageToBufferKHR 
- 
clCommandFillBufferKHR 
- 
clCommandFillImageKHR 
- 
clCommandNDRangeKernelKHR 
- 
clGetCommandBufferInfoKHR 
- 
The following SVM entry points are supported only with at least OpenCL 2.0, and starting from version 0.9.4 of this extension - 
clCommandSVMMemcpyKHR 
- 
clCommandSVMMemFillKHR 
 
- 
New Types
- 
cl_device_command_ buffer_ capabilities_ khr 
- 
cl_command_buffer_ khr 
- 
cl_sync_point_ khr 
- 
cl_command_buffer_ info_ khr 
- 
cl_command_buffer_ state_ khr 
- 
cl_command_buffer_ properties_ khr 
- 
cl_command_buffer_ flags_ khr 
- 
cl_command_properties_ khr 
- 
cl_mutable_command_ khr 
New Enums
- 
cl_device_info - 
CL_DEVICE_COMMAND_ BUFFER_ CAPABILITIES_ KHR 
- 
CL_DEVICE_COMMAND_ BUFFER_ REQUIRED_ QUEUE_ PROPERTIES_ KHR 
- 
CL_DEVICE_COMMAND_ BUFFER_ SUPPORTED_ QUEUE_ PROPERTIES_ KHR 
 
- 
- 
cl_device_command_ buffer_ capabilities_ khr - 
CL_COMMAND_BUFFER_ CAPABILITY_ KERNEL_ PRINTF_ KHR 
- 
CL_COMMAND_BUFFER_ CAPABILITY_ DEVICE_ SIDE_ ENQUEUE_ KHR 
- 
CL_COMMAND_BUFFER_ CAPABILITY_ SIMULTANEOUS_ USE_ KHR 
 
- 
- 
cl_command_buffer_ properties_ khr - 
CL_COMMAND_BUFFER_ FLAGS_ KHR 
 
- 
- 
cl_command_buffer_ flags_ khr - 
CL_COMMAND_BUFFER_ SIMULTANEOUS_ USE_ KHR 
 
- 
- 
cl_command_buffer_ info_ khr - 
CL_COMMAND_BUFFER_ QUEUES_ KHR 
- 
CL_COMMAND_BUFFER_ NUM_ QUEUES_ KHR 
- 
CL_COMMAND_BUFFER_ REFERENCE_ COUNT_ KHR 
- 
CL_COMMAND_BUFFER_ STATE_ KHR 
- 
CL_COMMAND_BUFFER_ PROPERTIES_ ARRAY_ KHR 
- 
CL_COMMAND_BUFFER_ CONTEXT_ KHR 
 
- 
- 
cl_command_buffer_ state_ khr - 
CL_COMMAND_BUFFER_ STATE_ RECORDING_ KHR 
- 
CL_COMMAND_BUFFER_ STATE_ EXECUTABLE_ KHR 
- 
CL_COMMAND_BUFFER_ STATE_ PENDING_ KHR 
 
- 
- 
cl_command_type - 
CL_COMMAND_COMMAND_ BUFFER_ KHR 
 
- 
- 
New Error Codes - 
CL_INVALID_COMMAND_ BUFFER_ KHR 
- 
CL_INVALID_SYNC_ POINT_ WAIT_ LIST_ KHR 
- 
CL_INCOMPATIBLE_COMMAND_ QUEUE_ KHR 
 
- 
Sample Code
  #define CL_CHECK(ERROR)                             \
    if (ERROR) {                                      \
      std::cerr << "OpenCL error: " << ERROR << "\n"; \
      return ERROR;                                   \
    }
  int main() {
    cl_platform_id platform;
    CL_CHECK(clGetPlatformIDs(1, &platform, nullptr));
    cl_device_id device;
    CL_CHECK(clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, nullptr));
    cl_int error;
    cl_context context =
        clCreateContext(nullptr, 1, &device, nullptr, nullptr, &error);
    CL_CHECK(error);
    const char* code = R"OpenCLC(
  kernel void vector_addition(global int* tile1, global int* tile2,
                              global int* res) {
    size_t index = get_global_id(0);
    res[index] = tile1[index] + tile2[index];
  }
  )OpenCLC";
    const size_t length = std::strlen(code);
    cl_program program =
        clCreateProgramWithSource(context, 1, &code, &length, &error);
    CL_CHECK(error);
    CL_CHECK(clBuildProgram(program, 1, &device, nullptr, nullptr, nullptr));
    cl_kernel kernel = clCreateKernel(program, "vector_addition", &error);
    CL_CHECK(error);
    constexpr size_t frame_count = 60;
    constexpr size_t frame_elements = 1024;
    constexpr size_t frame_size = frame_elements * sizeof(cl_int);
    constexpr size_t tile_count = 16;
    constexpr size_t tile_elements = frame_elements / tile_count;
    constexpr size_t tile_size = tile_elements * sizeof(cl_int);
    cl_mem buffer_tile1 =
        clCreateBuffer(context, CL_MEM_READ_ONLY, tile_size, nullptr, &error);
    CL_CHECK(error);
    cl_mem buffer_tile2 =
        clCreateBuffer(context, CL_MEM_READ_ONLY, tile_size, nullptr, &error);
    CL_CHECK(error);
    cl_mem buffer_res =
        clCreateBuffer(context, CL_MEM_WRITE_ONLY, tile_size, nullptr, &error);
    CL_CHECK(error);
    CL_CHECK(clSetKernelArg(kernel, 0, sizeof(buffer_tile1), &buffer_tile1));
    CL_CHECK(clSetKernelArg(kernel, 1, sizeof(buffer_tile2), &buffer_tile2));
    CL_CHECK(clSetKernelArg(kernel, 2, sizeof(buffer_res), &buffer_res));
    cl_command_queue command_queue =
      clCreateCommandQueue(context, device,
                           CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &error);
    CL_CHECK(error);
    cl_command_buffer_khr command_buffer =
        clCreateCommandBufferKHR(1, &command_queue, nullptr, &error);
    CL_CHECK(error);
    cl_mem buffer_src1 =
        clCreateBuffer(context, CL_MEM_READ_ONLY, frame_size, nullptr, &error);
    CL_CHECK(error);
    cl_mem buffer_src2 =
        clCreateBuffer(context, CL_MEM_READ_ONLY, frame_size, nullptr, &error);
    CL_CHECK(error);
    cl_mem buffer_dst =
        clCreateBuffer(context, CL_MEM_WRITE_ONLY, frame_size, nullptr, &error);
    CL_CHECK(error);
    cl_sync_point_khr tile_sync_point = 0;
    for (size_t tile_index = 0; tile_index < tile_count; tile_index++) {
      std::array<cl_sync_point_khr, 2> copy_sync_points;
      CL_CHECK(clCommandCopyBufferKHR(command_buffer,
          command_queue, buffer_src1, buffer_tile1, tile_index * tile_size, 0,
          tile_size, tile_sync_point ? 1 : 0,
          tile_sync_point ? &tile_sync_point : nullptr, ©_sync_points[0]),
          nullptr);
      CL_CHECK(clCommandCopyBufferKHR(command_buffer,
          command_queue, buffer_src2, buffer_tile2, tile_index * tile_size, 0,
          tile_size, tile_sync_point ? 1 : 0,
          tile_sync_point ? &tile_sync_point : nullptr, ©_sync_points[1]),
          nullptr);
      cl_sync_point_khr nd_sync_point;
      CL_CHECK(clCommandNDRangeKernelKHR(command_buffer,
          command_queue, nullptr, kernel, 1, nullptr, &tile_elements, nullptr,
          copy_sync_points.size(), copy_sync_points.data(), &nd_sync_point,
          nullptr));
      CL_CHECK(clCommandCopyBufferKHR(command_buffer,
          command_queue, buffer_res, buffer_dst, 0, tile_index * tile_size,
          tile_size, 1, &nd_sync_point, &tile_sync_point, nullptr));
    }
    CL_CHECK(clFinalizeCommandBufferKHR(command_buffer));
    std::random_device random_device;
    std::mt19937 random_engine{random_device()};
    std::uniform_int_distribution<cl_int> random_distribution{
        0, std::numeric_limits<cl_int>::max() / 2};
    auto random_generator = [&]() { return random_distribution(random_engine); };
    for (size_t frame_index = 0; frame_index < frame_count; frame_index++) {
      std::array<cl_event, 2> write_src_events;
      std::vector<cl_int> src1(frame_elements);
      std::generate(src1.begin(), src1.end(), random_generator);
      CL_CHECK(clEnqueueWriteBuffer(command_queue, buffer_src1, CL_FALSE, 0,
                                    frame_size, src1.data(), 0, nullptr,
                                    &write_src_events[0]));
      std::vector<cl_int> src2(frame_elements);
      std::generate(src2.begin(), src2.end(), random_generator);
      CL_CHECK(clEnqueueWriteBuffer(command_queue, buffer_src2, CL_FALSE, 0,
                                    frame_size, src2.data(), 0, nullptr,
                                    &write_src_events[1]));
      CL_CHECK(clEnqueueCommandBufferKHR(0, NULL, command_buffer, 2,
                                         write_src_events.data(), nullptr));
      CL_CHECK(clFinish(command_queue));
      CL_CHECK(clReleaseEvent(write_src_event[0]));
      CL_CHECK(clReleaseEvent(write_src_event[1]));
    }
    CL_CHECK(clReleaseCommandBufferKHR(command_buffer));
    CL_CHECK(clReleaseCommandQueue(command_queue));
    CL_CHECK(clReleaseMemObject(buffer_src1));
    CL_CHECK(clReleaseMemObject(buffer_src2));
    CL_CHECK(clReleaseMemObject(buffer_dst));
    CL_CHECK(clReleaseMemObject(buffer_tile1));
    CL_CHECK(clReleaseMemObject(buffer_tile2));
    CL_CHECK(clReleaseMemObject(buffer_res));
    CL_CHECK(clReleaseKernel(kernel));
    CL_CHECK(clReleaseProgram(program));
    CL_CHECK(clReleaseContext(context));
    return 0;
  }Issues
- 
Introduce a clCloneCommandBufferKHRentry-point for cloning a command-buffer.UNRESOLVED 
- 
Enable detached command-buffer execution, where command-buffers are executed on their own internal queue to prevent locking user created queues for the duration of their execution. UNRESOLVED 
- 
Give users more control over command-buffer command capacity via some or all of the following mechanisms. - 
Provide a way for a user to query a command-buffer for the maximum number of commands it can hold. 
- 
Guarantee a minimum command capacity that an implementation must support. 
- 
Provide a mechanism for users to reserve command-buffer capacity on command-buffer creation. 
 RESOLVED - Mechanisms to achieve this could be provided as a layered extension. 
- 
Version History
- 
Revision 0.9.0, 2021-11-10 - 
First assigned version (experimental). 
 
- 
- 
0.9.1, 2022-08-24 - 
Specify an error if a command-buffer is finalized multiple times (experimental). 
 
- 
- 
0.9.2, 2023-03-31 - 
Introduce context query CL_COMMAND_(experimental).BUFFER_ CONTEXT_ KHR 
 
- 
- 
0.9.3, 2023-04-04 - 
Remove Invalid command-buffer state (experimental). 
 
- 
- 
0.9.4, 2023-05-11 - 
Add clCommandSVMMemcpyKHR and clCommandSVMMemFillKHR command entries (experimental). 
 
- 
- 
0.9.5, 2024-07-24 - 
Add a properties parameter to all command recording entry-points (experimental). 
 
- 
- 
0.9.6, 2024-10-02 - 
Add device query for supported queue properties (experimental). 
 
- 
- 
0.9.7, 2024-12-13 - 
Refactor queue compatability between command-buffer creation and enqueue (experimental). 
 
- 
Document Notes
For more information, see the OpenCL Specification
This page is a generated document. Fixes and changes should be made to the generator scripts, not directly.