Description

If the cl_khr_extended_async_copies extension macro is supported, additional Built-in Extended Async Copy Functions are provided which interpret the source and destination as 2D or 3D data.

async_work_group_strided_copy is a special case of async_work_group_copy_2D2D, namely one which copies a single column to a single line or vice versa. For example:
async_work_group_strided_copy(dst, src, num_gentypes, src_stride, event) is equal to async_work_group_copy_2D2D(dst, 0, src, 0, sizeof(gentype), 1, num_gentypes, src_stride, 1, event)

The functions described in this section support arbitrary gentype-based buffers by casting pointers to void*.

These functions do not perform any implicit synchronization of source data such as using a barrier before performing the copy.

These functions are performed by all work-items in a work-group and must therefore be encountered by all work-items in a work-group executing the kernel with the same argument values; otherwise the results are undefined.

The src_offset, dst_offset, src_total_line_length, dst_total_line_length, src_total_plane_area and dst_total_plane_area function arguments are expressed in elements.

Both src_total_line_length and dst_total_line_length describe the number of elements between the beginning of the current line and the beginning of the next line.

Both src_total_plane_area and dst_total_plane_area describe the number of elements between the beginning of the current plane and the beginning of the next plane.

These functions return an event object that can be used by wait_group_events to wait for the async copy to finish. The event argument can also be used to associate the async copy with a previous async copy allowing an event to be shared by multiple async copies; otherwise event should be zero. If the event argument is non-zero, the event object supplied as the event argument will be returned.

Table 1. Built-in Extended Async Copy Functions
Function Description
event_t async_work_group_copy_2D2D(
  __local void *dst,
  size_t dst_offset,
  const __global void *src,
  size_t src_offset,
  size_t num_bytes_per_element,
  size_t num_elements_per_line,
  size_t num_lines,
  size_t src_total_line_length,
  size_t dst_total_line_length,
  event_t event)

event_t async_work_group_copy_2D2D(
  __global void *dst,
  size_t dst_offset,
  const __local void *src,
  size_t src_offset,
  size_t num_bytes_per_element,
  size_t num_elements_per_line,
  size_t num_lines,
  size_t src_total_line_length,
  size_t dst_total_line_length,
  event_t event)

Perform an async copy of (num_elements_per_line * num_lines) elements of size num_bytes_per_element from (src + (src_offset * num_bytes_per_element)) to (dst + (dst_offset * num_bytes_per_element)). All pointer arithmetic is performed with implicit casting to char* by the implementation. Each line contains num_elements_per_line elements of size num_bytes_per_element. After each line of transfer, the src address is incremented by src_total_line_length elements (i.e. src_total_line_length * num_bytes_per_element bytes), and the dst address is incremented by dst_total_line_length elements (i.e. dst_total_line_length * num_bytes_per_element bytes), for the next line of transfer.

The behavior of async_work_group_copy_2D2D is undefined if the source or destination addresses exceed the upper bounds of the address space during the copy.

The behavior of async_work_group_copy_2D2D is also undefined if the src_total_line_length or dst_total_line_length values are smaller than num_elements_per_line, i.e. overlapping of lines is undefined.

event_t async_work_group_copy_3D3D(
  __local void *dst,
  size_t dst_offset,
  const __global void *src,
  size_t src_offset,
  size_t num_bytes_per_element,
  size_t num_elements_per_line,
  size_t num_lines,
  size_t num_planes,
  size_t src_total_line_length,
  size_t src_total_plane_area,
  size_t dst_total_line_length,
  size_t dst_total_plane_area,
  event_t event)

event_t async_work_group_copy_3D3D(
  __global void *dst,
  size_t dst_offset,
  const __local void *src,
  size_t src_offset,
  size_t num_bytes_per_element,
  size_t num_elements_per_line,
  size_t num_lines,
  size_t num_planes,
  size_t src_total_line_length,
  size_t src_total_plane_area,
  size_t dst_total_line_length,
  size_t dst_total_plane_area,
  event_t event)

Perform an async copy of num_elements_per_line * num_lines) * num_planes) elements of size num_bytes_per_element from (src + (src_offset * num_bytes_per_element to (dst + (dst_offset * num_bytes_per_element)), arranged in num_planes planes. All pointer arithmetic is performed with implicit casting to char* by the implementation. Each plane contains num_lines lines. Each line contains num_elements_per_line elements. After each line of transfer, the src address is incremented by src_total_line_length elements (i.e. src_total_line_length * num_bytes_per_element bytes), and the dst address is incremented by dst_total_line_length elements (i.e. dst_total_line_length * num_bytes_per_element bytes), for the next line of transfer.

The behavior of async_work_group_copy_3D3D is undefined if the source or destination addresses exceed the upper bounds of the address space during the copy.

The behavior of async_work_group_copy_3D3D is also undefined if the src_total_line_length or dst_total_line_length values are smaller than num_elements_per_line, i.e. overlapping of lines is undefined.

The behavior of async_work_group_copy_3D3D is also undefined if src_total_plane_area is smaller than (num_lines * src_total_line_length), or dst_total_plane_area is smaller than (num_lines * dst_total_line_length), i.e. overlapping of planes is undefined.

See Also

No cross-references are available

Document Notes

For more information, see the OpenCL C Specification

This page is extracted from the OpenCL C Specification. Fixes and changes should be made to the Specification, not directly.

Copyright 2014-2025 The Khronos Group Inc.

SPDX-License-Identifier: CC-BY-4.0