Ratification Status

Ratified

Extension and Version Dependencies

  • This is a experimental extension and must be used with caution. See the description of experimental header files for enablement and stability details.

Other Extension Metadata

Last Modified Date

2024-12-13

IP Status

No known IP claims.

Contributors
  • Ewan Crawford, Codeplay Software Ltd.

  • Gordon Brown, Codeplay Software Ltd.

  • Kenneth Benzie, Codeplay Software Ltd.

  • Alastair Murray, Codeplay Software Ltd.

  • Jack Frankland, Codeplay Software Ltd.

  • Balaji Calidas, Qualcomm Technologies Inc.

  • Joshua Kelly, Qualcomm Technologies, Inc.

  • Kevin Petit, Arm Ltd.

  • Aharon Abramson, Intel.

  • Ben Ashbaugh, Intel.

  • Boaz Ouriel, Intel.

  • Pekka Jääskeläinen, Tampere University and Intel.

  • Jan Solanti, Tampere University

  • Nikhil Joshi, NVIDIA

  • James Price, Google

Description

The cl_khr_command_buffer extension separates command construction from enqueue by providing a mechanism to record a set of commands which can then be repeatedly enqueued. However, the commands in a command-buffer can only be recorded to a single command-queue specified on command-buffer creation.

cl_khr_command_buffer_multi_device extends the scope of a command-buffer to allow commands to be recorded across multiple queues in the same command-buffer, providing execution of heterogeneous task graphs from command-queues associated with different devices.

The ability for a user to deep copy an existing command-buffer so that the commands target a different device is also made possible by cl_khr_command_buffer_multi_device. Depending on platform support the mapping of commands to the new target device can be done either explicitly by the user, or automatically by the OpenCL runtime.

New Commands

  • clRemapCommandBufferKHR

New Types

Bitfield for querying command-buffer capabilities of an OpenCL Platform with clGetPlatformInfo, see the platform queries table:

  • cl_platform_command_buffer_capabilities_khr

New Enums

Enums for querying device command-buffer capabilities with clGetDeviceInfo, see the device queries table:

  • cl_device_info

    • CL_DEVICE_COMMAND_BUFFER_NUM_SYNC_DEVICES_KHR

    • CL_DEVICE_COMMAND_BUFFER_SYNC_DEVICES_KHR

  • cl_device_command_buffer_capabilities_khr

    • CL_COMMAND_BUFFER_CAPABILITY_MULTIPLE_QUEUE_KHR

  • cl_command_buffer_flags_khr

    • CL_COMMAND_BUFFER_DEVICE_SIDE_SYNC_KHR

  • cl_platform_info

    • CL_PLATFORM_COMMAND_BUFFER_CAPABILITIES_KHR

  • cl_platform_command_buffer_capabilities_khr

    • CL_COMMAND_BUFFER_PLATFORM_UNIVERSAL_SYNC_KHR

    • CL_COMMAND_BUFFER_PLATFORM_REMAP_QUEUES_KHR

    • CL_COMMAND_BUFFER_PLATFORM_AUTOMATIC_REMAP_KHR

Sample Code

#define CL_CHECK(ERROR)                                                        \
  if (ERROR) {                                                                 \
    std::cerr << "OpenCL error: " << ERROR << "\n";                            \
    return ERROR;                                                              \
  }

int main() {
  cl_platform_id platform;
  CL_CHECK(clGetPlatformIDs(1, &platform, nullptr));
  cl_platform_command_buffer_capabilities_khr platform_caps;
  CL_CHECK(clGetPlatformInfo(platform,
                             CL_PLATFORM_COMMAND_BUFFER_CAPABILITIES_KHR,
                             sizeof(platform_caps), &platform_caps, NULL));
  if (!(platform_caps & CL_COMMAND_BUFFER_PLATFORM_AUTOMATIC_REMAP_KHR)) {
    std::cerr << "Command-buffer remapping not supported but used in example, "
                 "skipping\n";
    return 0;
  }

  cl_uint num_devices = 0;
  CL_CHECK(clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices));
  std::vector<cl_device_id> devices(num_devices);
  CL_CHECK(
      clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, devices.data(), nullptr));

  // Checks omitted for brevity that either a) the platform supports
  // CL_COMMAND_BUFFER_PLATFORM_UNIVERSAL_SYNC_KHR or b) each device is listed
  // in the others CL_DEVICE_COMMAND_BUFFER_SYNC_DEVICES_KHR

  cl_int error;
  cl_context context =
      clCreateContext(NULL, num_devices, devices.data(), NULL, NULL, &error);
  CL_CHECK(error);

  std::vector<cl_command_queue> queues(num_devices);
  for (cl_uint i = 0; i < num_devices; i++) {
    queues[i] = clCreateCommandQueue(context, devices[i], 0, &error);
    CL_CHECK(error);
  }

  const char *code = R"OpenCLC(
  kernel void vector_addition(global int* tile1, global int* tile2,
                              global int* res) {
    size_t index = get_global_id(0);
    res[index] = tile1[index] + tile2[index];
  }
  )OpenCLC";
  const size_t length = std::strlen(code);

  cl_program program =
      clCreateProgramWithSource(context, 1, &code, &length, &error);
  CL_CHECK(error);

  CL_CHECK(
      clBuildProgram(program, num_devices, devices.data(), NULL, NULL, NULL));

  cl_kernel kernel = clCreateKernel(program, "vector_addition", &error);
  CL_CHECK(error);

  constexpr size_t frame_count = 60;
  constexpr size_t frame_elements = 1024;
  constexpr size_t frame_size = frame_elements * sizeof(cl_int);

  constexpr size_t tile_count = 16;
  constexpr size_t tile_elements = frame_elements / tile_count;
  constexpr size_t tile_size = tile_elements * sizeof(cl_int);

  cl_mem buffer_tile1 =
      clCreateBuffer(context, CL_MEM_READ_ONLY, tile_size, NULL, &error);
  CL_CHECK(error);

  cl_mem buffer_tile2 =
      clCreateBuffer(context, CL_MEM_READ_ONLY, tile_size, NULL, &error);
  CL_CHECK(error);

  cl_mem buffer_res =
      clCreateBuffer(context, CL_MEM_WRITE_ONLY, tile_size, NULL, &error);
  CL_CHECK(error);

  CL_CHECK(clSetKernelArg(kernel, 0, sizeof(buffer_tile1), &buffer_tile1));
  CL_CHECK(clSetKernelArg(kernel, 1, sizeof(buffer_tile2), &buffer_tile2));
  CL_CHECK(clSetKernelArg(kernel, 2, sizeof(buffer_res), &buffer_res));

  cl_command_buffer_khr original_cmdbuf =
      clCreateCommandBufferKHR(num_devices, queues.data(), nullptr, &error);
  CL_CHECK(error);

  cl_mem buffer_src1 =
      clCreateBuffer(context, CL_MEM_READ_ONLY, frame_size, NULL, &error);
  CL_CHECK(error);

  cl_mem buffer_src2 =
      clCreateBuffer(context, CL_MEM_READ_ONLY, frame_size, NULL, &error);
  CL_CHECK(error);

  cl_mem buffer_dst =
      clCreateBuffer(context, CL_MEM_READ_WRITE, frame_size, NULL, &error);
  CL_CHECK(error);

  cl_sync_point_khr tile_sync_point = 0;
  for (size_t tile_index = 0; tile_index < tile_count; tile_index++) {
    cl_sync_point_khr copy_sync_points[2];
    CL_CHECK(clCommandCopyBufferKHR(
        original_cmdbuf, queues[tile_index % num_devices], buffer_src1,
        buffer_tile1, tile_index * tile_size, 0, tile_size,
        tile_sync_point ? 1 : 0, tile_sync_point ? &tile_sync_point : NULL,
        &copy_sync_points[0], NULL));

    CL_CHECK(clCommandCopyBufferKHR(
                 original_cmdbuf, queues[tile_index % num_devices], buffer_src2,
                 buffer_tile2, tile_index * tile_size, 0, tile_size,
                 tile_sync_point ? 1 : 0,
                 tile_sync_point ? &tile_sync_point : nullptr,
                 &copy_sync_points[1], NULL));

    cl_sync_point_khr nd_sync_point;
    CL_CHECK(clCommandNDRangeKernelKHR(
        original_cmdbuf, queues[tile_index % num_devices], NULL, kernel, 1,
        NULL, &tile_elements, NULL, 2, copy_sync_points, &nd_sync_point, NULL));

    CL_CHECK(clCommandCopyBufferKHR(
        original_cmdbuf, queues[tile_index % num_devices], buffer_res,
        buffer_dst, 0, tile_index * tile_size, tile_size, 1, &nd_sync_point,
        &tile_sync_point, NULL));
  }

  CL_CHECK(clFinalizeCommandBufferKHR(original_cmdbuf));

  std::random_device random_device;
  std::mt19937 random_engine{random_device()};
  std::uniform_int_distribution<cl_int> random_distribution{
      0, std::numeric_limits<cl_int>::max() / 2};
  auto random_generator = [&]() { return random_distribution(random_engine); };

  auto enqueue_frame = [&](cl_command_buffer_khr command_buffer) {
    for (size_t frame_index = 0; frame_index < frame_count; frame_index++) {
      std::array<cl_event, 3> enqueue_events;
      std::vector<cl_int> src1(frame_elements);
      std::generate(src1.begin(), src1.end(), random_generator);
      CL_CHECK(clEnqueueWriteBuffer(queues[0], buffer_src1, CL_FALSE, 0,
                                    frame_size, src1.data(), 0, nullptr,
                                    &enqueue_events[0]));
      std::vector<cl_int> src2(frame_elements);
      std::generate(src2.begin(), src2.end(), random_generator);
      CL_CHECK(clEnqueueWriteBuffer(queues[0], buffer_src2, CL_FALSE, 0,
                                    frame_size, src2.data(), 0, nullptr,
                                    &enqueue_events[1]));

      CL_CHECK(clEnqueueCommandBufferKHR(0, NULL, command_buffer, 2,
                                         enqueue_events.data(),
                                         &enqueue_events[2]));

      CL_CHECK(clWaitForEvents(1, enqueue_events[2]));

      for (auto e : enqueue_events) {
        CL_CHECK(clReleaseEvent(e));
      }
    }
    return 0;
  };

  error = enqueue_frame(original_cmdbuf);
  CL_CHECK(error);

  // Remap from N queues to 1 queue and run again
  cl_command_buffer_khr remapped_cmdbuf = clRemapCommandBufferKHR(
      original_cmdbuf, CL_TRUE, 1, queues.data(), 0, NULL, NULL, &error);
  CL_CHECK(error);

  error = enqueue_frame(remapped_cmdbuf);
  CL_CHECK(error);

  for (unsigned i = 0; i < num_devices; ++i) {
    CL_CHECK(clReleaseCommandQueue(queues[i]));
  }
  CL_CHECK(clReleaseMemObject(buffer_src1));
  CL_CHECK(clReleaseMemObject(buffer_src2));
  CL_CHECK(clReleaseMemObject(buffer_dst));

  CL_CHECK(clReleaseMemObject(buffer_tile1));
  CL_CHECK(clReleaseMemObject(buffer_tile2));
  CL_CHECK(clReleaseMemObject(buffer_res));

  CL_CHECK(clReleaseCommandBufferKHR(original_cmdbuf));
  CL_CHECK(clReleaseCommandBufferKHR(remapped_cmdbuf));

  CL_CHECK(clReleaseKernel(kernel));
  CL_CHECK(clReleaseProgram(program));
  CL_CHECK(clReleaseContext(context));

  return 0;
}

Issues

  1. In cl_event profiling info for a command-buffer running across the queues for several devices, how do we know what the first & last commands executed are if there is concurrent execution across devices.

    RESOLVED: Allowed an implementation to fallback to CL_PROFILING_COMMAND_SUBMIT and CL_PROFILING_COMMAND_COMPLETE when reporting CL_PROFILING_COMMAND_START & CL_PROFILING_COMMAND_END.

  2. Is an atomic constraint required? This would forbid regular clEnqueue* commands, from interleaving execution on a queue which a command-buffer is being executed on.

    RESOLVED: This behavior can block parallelism, and constraint is expressible by the user through existing synchronization mechanisms if they require it.

  3. It is currently an error if a set of command-queues passed to clEnqueueCommandBufferKHR aren’t compatible with those set on recording. Should we relax this as an optional capability that allows an implementation to do a more expensive command-buffer enqueue for this case?

    RESOLVED: Added as an optional feature.

Version History

  • Revision 0.9.0, 2023-04-14

    • First assigned version (experimental).

  • Revision 0.9.1, 2023-04-30

    • Added clCommandSVMMemcpyKHR and clCommandSVMMemFillKHR as affected functions (experimental).

  • Revision 0.9.2, 2024-12-13

    • Update clRemapCommandBufferKHR behavior to match cl_khr_command_buffer version 0.9.7 (experimental).

See Also

No cross-references are available

Document Notes

For more information, see the OpenCL Specification

This page is a generated document. Fixes and changes should be made to the generator scripts, not directly.

Copyright 2014-2025 The Khronos Group Inc.

SPDX-License-Identifier: CC-BY-4.0