Copyright 2008-2023 The Khronos Group Inc.
This Specification is protected by copyright laws and contains material proprietary to Khronos. Except as described by these terms, it or any components may not be reproduced, republished, distributed, transmitted, displayed, broadcast or otherwise exploited in any manner without the express prior written permission of Khronos.
This Specification has been created under the Khronos Intellectual Property Rights Policy, which is Attachment A of the Khronos Group Membership Agreement available at www.khronos.org/files/member_agreement.pdf and defines the terms 'Scope', 'Compliant Portion', and 'Necessary Patent Claims'.
Khronos grants a conditional copyright license to use and reproduce the unmodified Specification for any purpose, without fee or royalty, EXCEPT no licenses to any patent, trademark or other intellectual property rights are granted under these terms. Parties desiring to implement the Specification and make use of Khronos trademarks in relation to that implementation, and receive reciprocal patent license protection under the Khronos Intellectual Property Rights Policy must become Adopters and confirm the implementation as conformant under the process defined by Khronos for this Specification; see https://www.khronos.org/adopters.
Khronos makes no, and expressly disclaims any, representations or warranties, express or implied, regarding this Specification, including, without limitation: merchantability, fitness for a particular purpose, non-infringement of any intellectual property, correctness, accuracy, completeness, timeliness, and reliability. Under no circumstances will Khronos, or any of its Promoters, Contributors or Members, or their respective partners, officers, directors, employees, agents or representatives be liable for any damages, whether direct, indirect, special or consequential damages for lost revenues, lost profits, or otherwise, arising from or in connection with these materials.
Where this Specification identifies specific sections of external references, only those specifically identified sections define normative functionality. The Khronos Intellectual Property Rights Policy excludes external references to materials and associated enabling technology not created by Khronos from the Scope of this specification, and any licenses that may be required to implement such referenced materials and associated technologies must be obtained separately and may involve royalty payments.
Khronos® and Vulkan® are registered trademarks, and SPIR™, SPIR-V™, and SYCL™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used under license by Khronos. OpenGL® is a registered trademark and the OpenGL ES™ and OpenGL SC™ logos are trademarks of Hewlett Packard Enterprise used under license by Khronos. All other product names, trademarks, and/or company names are used solely for identification and belong to their respective owners.
1. Introduction
Modern processor architectures have embraced parallelism as an important pathway to increased performance. Facing technical challenges with higher clock speeds in a fixed power envelope, Central Processing Units (CPUs) now improve performance by adding multiple cores. Graphics Processing Units (GPUs) have also evolved from fixed function rendering devices into programmable parallel processors. As todays computer systems often include highly parallel CPUs, GPUs and other types of processors, it is important to enable software developers to take full advantage of these heterogeneous processing platforms.
Creating applications for heterogeneous parallel processing platforms is challenging as traditional programming approaches for multi-core CPUs and GPUs are very different. CPU-based parallel programming models are typically based on standards but usually assume a shared address space and do not encompass vector operations. General purpose GPU programming models address complex memory hierarchies and vector operations but are traditionally platform-, vendor- or hardware-specific. These limitations make it difficult for a developer to access the compute power of heterogeneous CPUs, GPUs and other types of processors from a single, multi-platform source code base. More than ever, there is a need to enable software developers to effectively take full advantage of heterogeneous processing platforms from high performance compute servers, through desktop computer systems to handheld devices - that include a diverse mix of parallel CPUs, GPUs and other processors such as DSPs and the Cell/B.E. processor.
OpenCL (Open Computing Language) is an open royalty-free standard for general purpose parallel programming across CPUs, GPUs and other processors, giving software developers portable and efficient access to the power of these heterogeneous processing platforms.
OpenCL supports a wide range of applications, ranging from embedded and consumer software to HPC solutions, through a low-level, high-performance, portable abstraction. By creating an efficient, close-to-the-metal programming interface, OpenCL will form the foundation layer of a parallel computing ecosystem of platform-independent tools, middleware and applications. OpenCL is particularly suited to play an increasingly significant role in emerging interactive graphics applications that combine general parallel compute algorithms with graphics rendering pipelines.
OpenCL consists of an API for coordinating parallel computation across heterogeneous processors, a cross-platform programming language, and a cross-platform intermediate language with a well-specified computation environment. The OpenCL standard:
-
Supports both data- and task-based parallel programming models
-
Supports kernels written using a subset of ISO C99 with extensions for parallel execution
-
Supports kernels represented by a portable and self-contained intermediate language (e.g. SPIR-V) with support for parallel execution
-
Defines consistent numerical requirements based on IEEE 754
-
Defines a configuration profile for handheld and embedded devices
-
Supports efficient interop with OpenGL, OpenGL ES and other APIs
This document begins with an overview of basic concepts and the architecture of OpenCL, followed by a detailed description of its execution model, memory model and synchronization support. It then discusses the OpenCL platform and runtime API. Some examples are given that describe sample compute use-cases and how they would be written in OpenCL. The specification is divided into a core specification that any OpenCL compliant implementation must support; a handheld/embedded profile which relaxes the OpenCL compliance requirements for handheld and embedded devices; and a set of optional extensions that are likely to move into the core specification in later revisions of the OpenCL specification.
1.1. Normative References
Normative references are references to external documents or resources to which implementers of OpenCL must comply with all, or specified portions of, as described in this specification.
ISO/IEC 9899:2011 - Information technology - Programming languages - C, https://www.iso.org/standard/57853.html (final specification), http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1570.pdf (last public draft).
1.2. Version Numbers
The OpenCL version number follows a major.minor-revision scheme. When this version number is used within the API it generally only includes the major.minor components of the version number.
A difference in the major or minor version number indicates that some amount of new functionality has been added to the specification, and may also include behavior changes and bug fixes. Functionality may also be deprecated or removed when the major or minor version changes.
A difference in the revision number indicates small changes to the specification, typically to fix a bug or to clarify language. When the revision number changes there may be an impact on the behavior of existing functionality, but this should not affect backwards compatibility. Functionality should not be added or removed when the revision number changes.
1.3. Unified Specification
This document specifies all versions of the OpenCL API.
There are three ways that an OpenCL feature may be described in terms of what versions of OpenCL support that feature.
-
Missing before major.minor: Features that were introduced in version major.minor. Implementations of an earlier version of OpenCL will not provide these features.
-
Deprecated by major.minor: Features that were deprecated in version major.minor, see the definition of deprecation in the glossary.
-
Universal: Features that have no mention of what version they are missing before or deprecated by are available in all versions of OpenCL.
2. Glossary
- Application
-
The combination of the program running on the host and OpenCL devices.
- Acquire semantics
-
One of the memory order semantics defined for synchronization operations. Acquire semantics apply to atomic operations that load from memory. Given two units of execution, A and B, acting on a shared atomic object M, if A uses an atomic load of M with acquire semantics to synchronize-with an atomic store to M by B that used release semantics, then A's atomic load will occur before any subsequent operations by A. Note that the memory orders release, sequentially consistent, and acquire_release all include release semantics and effectively pair with a load using acquire semantics.
- Acquire release semantics
-
A memory order semantics for synchronization operations (such as atomic operations) that has the properties of both acquire and release memory orders. It is used with read-modify-write operations.
- Atomic operations
-
Operations that at any point, and from any perspective, have either occurred completely, or not at all. Memory orders associated with atomic operations may constrain the visibility of loads and stores with respect to the atomic operations (see relaxed semantics, acquire semantics, release semantics or acquire release semantics).
- Blocking and Non-Blocking Enqueue API calls
-
A non-blocking enqueue API call places a command on a command-queue and returns immediately to the host. The blocking-mode enqueue API calls do not return to the host until the command has completed.
- Barrier
-
There are three types of barriers a command-queue barrier, a work-group barrier and a sub-group barrier.
-
The OpenCL API provides a function to enqueue a command-queue barrier command. This barrier command ensures that all previously enqueued commands to a command-queue have finished execution before any following commands enqueued in the command-queue can begin execution.
-
The OpenCL kernel execution model provides built-in work-group barrier functionality. This barrier built-in function can be used by a kernel executing on a device to perform synchronization between work-items in a work-group executing the kernel. All the work-items of a work-group must execute the barrier construct before any are allowed to continue execution beyond the barrier.
-
The OpenCL kernel execution model provides built-in sub-group barrier functionality. This barrier built-in function can be used by a kernel executing on a device to perform synchronization between work-items in a sub-group executing the kernel. All the work-items of a sub-group must execute the barrier construct before any are allowed to continue execution beyond the barrier.
-
- Buffer Object
-
A memory object that stores a linear collection of bytes. Buffer objects are accessible using a pointer in a kernel executing on a device. Buffer objects can be manipulated by the host using OpenCL API calls. A buffer object encapsulates the following information:
-
Size in bytes.
-
Properties that describe usage information and which region to allocate from.
-
Buffer data.
-
- Built-in Kernel
-
A built-in kernel is a kernel that is executed on an OpenCL device or custom device by fixed-function hardware or in firmware. Applications can query the built-in kernels supported by a device or custom device. A program object can only contain kernels written in OpenCL C or built-in kernels but not both. See also Kernel and Program.
- Child kernel
-
See Device-side enqueue.
- Command
-
The OpenCL operations that are submitted to a command-queue for execution. For example, OpenCL commands issue kernels for execution on a compute device, manipulate memory objects, etc.
- Command-queue
-
An object that holds commands that will be executed on a specific device. The command-queue is created on a specific device in a context. Commands to a command-queue are queued in-order but may be executed in-order or out-of-order. Refer to In-order Execution_and_Out-of-order Execution.
- Command-queue Barrier
-
See Barrier.
- Command synchronization
-
Constraints on the order that commands are launched for execution on a device defined in terms of the synchronization points that occur between commands in host command-queues and between commands in device-side command-queues. See synchronization points.
- Complete
-
The final state in the six state model for the execution of a command. The transition into this state occurs is signaled through event objects or callback functions associated with a command.
- Compute Device Memory
-
This refers to one or more memories attached to the compute device.
- Compute Unit
-
An OpenCL device has one or more compute units. A work-group executes on a single compute unit. A compute unit is composed of one or more processing elements and local memory. A compute unit may also include dedicated texture filter units that can be accessed by its processing elements.
- Concurrency
-
A property of a system in which a set of tasks in a system can remain active and make progress at the same time. To utilize concurrent execution when running a program, a programmer must identify the concurrency in their problem, expose it within the source code, and then exploit it using a notation that supports concurrency.
- Constant Memory
-
A region of global memory that remains constant during the execution of a kernel. The host allocates and initializes memory objects placed into constant memory.
- Context
-
The environment within which the kernels execute and the domain in which synchronization and memory management is defined. The context includes a set of devices, the memory accessible to those devices, the corresponding memory properties and one or more command-queues used to schedule execution of a kernel(s) or operations on memory objects.
- Control flow
-
The flow of instructions executed by a work-item. Multiple logically related work-items may or may not execute the same control flow. The control flow is said to be converged if all the work-items in the set execution the same stream of instructions. In a diverged control flow, the work-items in the set execute different instructions. At a later point, if a diverged control flow becomes converged, it is said to be a re-converged control flow.
- Converged control flow
-
See Control flow.
- Custom Device
-
An OpenCL device that fully implements the OpenCL Runtime but does not support programs written in OpenCL C. A custom device may be specialized non-programmable hardware that is very power efficient and performant for directed tasks or hardware with limited programmable capabilities such as specialized DSPs. Custom devices are not OpenCL conformant. Custom devices may support an online compiler. Programs for custom devices can be created using the OpenCL runtime APIs that allow OpenCL programs to be created from source (if an online compiler is supported) and/or binary, or from built-in kernels supported by the device. See also Device.
- Data Parallel Programming Model
-
Traditionally, this term refers to a programming model where concurrency is expressed as instructions from a single program applied to multiple elements within a set of data structures. The term has been generalized in OpenCL to refer to a model wherein a set of instructions from a single program are applied concurrently to each point within an abstract domain of indices.
- Data race
-
The execution of a program contains a data race if it contains two actions in different work-items or host threads where (1) one action modifies a memory location and the other action reads or modifies the same memory location, and (2) at least one of these actions is not atomic, or the corresponding memory scopes are not inclusive, and (3) the actions are global actions unordered by the global-happens-before relation or are local actions unordered by the local-happens before relation.
- Deprecation
-
Existing features are marked as deprecated if their usage is not recommended as that feature is being de-emphasized, superseded and may be removed from a future version of the specification.
- Device
-
A device is a collection of compute units. A command-queue is used to queue commands to a device. Examples of commands include executing kernels, or reading and writing memory objects. OpenCL devices typically correspond to a GPU, a multi-core CPU, and other processors such as DSPs and the Cell/B.E. processor.
- Device-side enqueue
-
A mechanism whereby a kernel-instance is enqueued by a kernel-instance running on a device without direct involvement by the host program. This produces nested parallelism; i.e. additional levels of concurrency are nested inside a running kernel-instance. The kernel-instance executing on a device (the parent kernel) enqueues a kernel-instance (the child kernel) to a device-side command queue. Child and parent kernels execute asynchronously though a parent kernel does not complete until all of its child-kernels have completed.
- Diverged control flow
-
See Control flow.
- Ended
-
The fifth state in the six state model for the execution of a command. The transition into this state occurs when execution of a command has ended. When a Kernel-enqueue command ends, all of the work-groups associated with that command have finished their execution.
- Event Object
-
An event object encapsulates the status of an operation such as a command. It can be used to synchronize operations in a context.
- Event Wait List
-
An event wait list is a list of event objects that can be used to control when a particular command begins execution.
- Fence
-
A memory ordering operation without an associated atomic object. A fence can use the acquire semantics, release semantics, or acquire release semantics.
- Framework
-
A software system that contains the set of components to support software development and execution. A framework typically includes libraries, APIs, runtime systems, compilers, etc.
- Generic address space
-
An address space that include the private, local, and global address spaces available to a device. The generic address space supports conversion of pointers to and from private, local and global address spaces, and hence lets a programmer write a single function that at compile time can take arguments from any of the three named address spaces.
- Global Happens before
-
See Happens before.
- Global ID
-
A global ID is used to uniquely identify a work-item and is derived from the number of global work-items specified when executing a kernel. The global ID is a N-dimensional value that starts at (0, 0, … 0). See also Local ID.
- Global Memory
-
A memory region accessible to all work-items executing in a context. It is accessible to the host using commands such as read, write and map. Global memory is included within the generic address space that includes the private and local address spaces.
- GL share group
-
A GL share group object manages shared OpenGL or OpenGL ES resources such as textures, buffers, framebuffers, and renderbuffers and is associated with one or more GL context objects. The GL share group is typically an opaque object and not directly accessible.
- Handle
-
An opaque type that references an object allocated by OpenCL. Any operation on an object occurs by reference to that object’s handle. Each object must have a unique handle value during the course of its lifetime. Handle values may be, but are not required to be, re-used by an implementation.
- Happens before
-
An ordering relationship between operations that execute on multiple units of execution. If an operation A happens-before operation B then A must occur before B; in particular, any value written by A will be visible to B. We define two separate happens before relations: global-happens-before and local-happens-before. These are defined in Memory Ordering Rules.
- Host
-
The host interacts with the context using the OpenCL API.
- Host-thread
-
The unit of execution that executes the statements in the host program.
- Host pointer
-
A pointer to memory that is in the virtual address space on the host.
- Illegal
-
Behavior of a system that is explicitly not allowed and will be reported as an error when encountered by OpenCL.
- Image Object
-
A memory object that stores a two- or three-dimensional structured array. Image data can only be accessed with read and write functions. The read functions use a sampler.
The image object encapsulates the following information:
-
Dimensions of the image.
-
Description of each element in the image.
-
Properties that describe usage information and which region to allocate from.
-
Image data.
The elements of an image are selected from a list of predefined image formats.
-
- Implementation Defined
-
Behavior that is explicitly allowed to vary between conforming implementations of OpenCL. An OpenCL implementor is required to document the implementation-defined behavior.
- Independent Forward Progress
-
If an entity supports independent forward progress, then if it is otherwise not dependent on any actions due to be performed by any other entity (for example it does not wait on a lock held by, and thus that must be released by, any other entity), then its execution cannot be blocked by the execution of any other entity in the system (it will not be starved). Work-items in a subgroup, for example, typically do not support independent forward progress, so one work-item in a subgroup may be completely blocked (starved) if a different work-item in the same subgroup enters a spin loop.
- In-order Execution
-
A model of execution in OpenCL where the commands in a command-queue are executed in order of submission with each command running to completion before the next one begins. See Out-of-order Execution.
- Intermediate Language
-
A lower-level language that may be used to create programs. SPIR-V is a required intermediate language (IL) for OpenCL 2.1 and 2.2 devices. Other OpenCL devices may optionally support SPIR-V or other ILs.
- Kernel
-
A kernel is a function declared in a program and executed on an OpenCL device. A kernel is identified by the
__kernel
orkernel
qualifier applied to any function defined in a program. - Kernel-instance
-
The work carried out by an OpenCL program occurs through the execution of kernel-instances on devices. The kernel instance is the kernel object, the values associated with the arguments to the kernel, and the parameters that define the NDRange index space.
- Kernel Object
-
A kernel object encapsulates a specific kernel function declared in a program and the argument values to be used when executing this kernel function.
- Kernel Language
-
A language that is used to represent source code for kernel. Kernels may be directly created from OpenCL C kernel language source strings. Other kernel languages may be supported by compiling to SPIR-V, another supported Intermediate Language, or to a device-specific program binary format.
- Launch
-
The transition of a command from the submitted state to the ready state. See Ready.
- Local ID
-
A local ID specifies a unique work-item ID within a given work-group that is executing a kernel. The local ID is a N-dimensional value that starts at (0, 0, … 0). See also Global ID.
- Local Memory
-
A memory region associated with a work-group and accessible only by work-items in that work-group. Local memory is included within the generic address space that includes the private and global address spaces.
- Marker
-
A command queued in a command-queue that can be used to tag all commands queued before the marker in the command-queue. The marker command returns an event which can be used by the application to queue a wait on the marker event i.e. wait for all commands queued before the marker command to complete.
- Memory Consistency Model
-
Rules that define which values are observed when multiple units of execution load data from any shared memory plus the synchronization operations that constrain the order of memory operations and define synchronization relationships. The memory consistency model in OpenCL is based on the memory model from the ISO C11 programming language.
- Memory Objects
-
A memory object is a handle to a reference counted region of Global Memory. Also see Buffer Object and Image Object.
- Memory Regions (or Pools)
-
A distinct address space in OpenCL. Memory regions may overlap in physical memory though OpenCL will treat them as logically distinct. The memory regions are denoted as private, local, constant, and global.
- Memory Scopes
-
These memory scopes define a hierarchy of visibilities when analyzing the ordering constraints of memory operations. They are defined by the values of the memory_scope enumeration constant. Current values are memory_scope_work_item (memory constraints only apply to a single work-item and in practice apply only to image operations), memory_scope_sub_group (memory-ordering constraints only apply to work-items executing in a sub-group), memory_scope_work_group (memory-ordering constraints only apply to work-items executing in a work-group), memory_scope_device (memory-ordering constraints only apply to work-items executing on a single device) and memory_scope_all_svm_devices or equivalently memory_scope_all_devices (memory-ordering constraints only apply to work-items executing across multiple devices and when using shared virtual memory).
- Modification Order
-
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M. If A and B are modifications of an atomic object M, and A happens-before B, then A shall precede B in the modification order of M. Note that the modification order of an atomic object M is independent of whether M is in local or global memory.
- Nested Parallelism
-
See device-side enqueue.
- Object
-
Objects are abstract representation of the resources that can be manipulated by the OpenCL API. Examples include program objects, kernel objects, and memory objects.
- Out-of-Order Execution
-
A model of execution in which commands placed in the work queue may begin and complete execution in any order consistent with constraints imposed by event wait lists_and_command-queue barrier. See In-order Execution.
- Parent device
-
The OpenCL device which is partitioned to create sub-devices. Not all parent devices are root devices. A root device might be partitioned and the sub-devices partitioned again. In this case, the first set of sub-devices would be parent devices of the second set, but not the root devices. Also see Device, parent device and root device.
- Parent kernel
-
see Device-side enqueue.
- Pipe
-
The pipe memory object conceptually is an ordered sequence of data items. A pipe has two endpoints: a write endpoint into which data items are inserted, and a read endpoint from which data items are removed. At any one time, only one kernel instance may write into a pipe, and only one kernel instance may read from a pipe. To support the producer consumer design pattern, one kernel instance connects to the write endpoint (the producer) while another kernel instance connects to the reading endpoint (the consumer).
- Platform
-
The host plus a collection of devices managed by the OpenCL framework that allow an application to share resources and execute kernels on devices in the platform.
- Private Memory
-
A region of memory private to a work-item. Variables defined in one work-items private memory are not visible to another work-item.
- Processing Element
-
A virtual scalar processor. A work-item may execute on one or more processing elements.
- Program
-
An OpenCL program consists of a set of kernels. Programs may also contain auxiliary functions called by the kernel functions and constant data.
- Program Object
-
A program object encapsulates the following information:
-
A reference to an associated context.
-
A program source or binary.
-
The latest successfully built program executable, the list of devices for which the program executable is built, the build options used and a build log.
-
The number of kernel objects currently attached.
-
- Queued
-
The first state in the six state model for the execution of a command. The transition into this state occurs when the command is enqueued into a command-queue.
- Ready
-
The third state in the six state model for the execution of a command. The transition into this state occurs when pre-requisites constraining execution of a command have been met; i.e. the command has been launched. When a kernel-enqueue command is launched, work-groups associated with the command are placed in a devices work-pool from which they are scheduled for execution.
- Re-converged Control Flow
-
see Control flow.
- Reference Count
-
The life span of an OpenCL object is determined by its reference count, an internal count of the number of references to the object. When you create an object in OpenCL, its reference count is set to one. Subsequent calls to the appropriate retain API (such as clRetainContext, clRetainCommandQueue) increment the reference count. Calls to the appropriate release API (such as clReleaseContext, clReleaseCommandQueue) decrement the reference count. Implementations may also modify the reference count, e.g. to track attached objects or to ensure correct operation of in-progress or scheduled activities. The object becomes inaccessible to host code when the number of release operations performed matches the number of retain operations plus the allocation of the object. At this point the reference count may be zero but this is not guaranteed.
- Relaxed Consistency
-
A memory consistency model in which the contents of memory visible to different work-items or commands may be different except at a barrier or other explicit synchronization points.
- Relaxed Semantics
-
A memory order semantics for atomic operations that implies no order constraints. The operation is atomic but it has no impact on the order of memory operations.
- Release Semantics
-
One of the memory order semantics defined for synchronization operations. Release semantics apply to atomic operations that store to memory. Given two units of execution, A and B, acting on a shared atomic object M, if A uses an atomic store of M with release semantics to synchronize-with an atomic load to M by B that used acquire semantics, then A's atomic store will occur after any prior operations by A. Note that the memory orders acquire, sequentially consistent, and acquire_release all include acquire semantics and effectively pair with a store using release semantics.
- Remainder work-groups
-
When the work-groups associated with a kernel-instance are defined, the sizes of a work-group in each dimension may not evenly divide the size of the NDRange in the corresponding dimensions. The result is a collection of work-groups on the boundaries of the NDRange that are smaller than the base work-group size. These are known as remainder work-groups.
- Running
-
The fourth state in the six state model for the execution of a command. The transition into this state occurs when the execution of the command starts. When a Kernel-enqueue command starts, one or more work-groups associated with the command start to execute.
- Root device
-
A root device is an OpenCL device that has not been partitioned. Also see Device, Parent device and Root device.
- Resource
-
A class of objects defined by OpenCL. An instance of a resource is an object. The most common resources are the context, command-queue, program objects, kernel objects, and memory objects. Computational resources are hardware elements that participate in the action of advancing a program counter. Examples include the host, devices, compute units and processing elements.
- Retain, Release
-
The action of incrementing (retain) and decrementing (release) the reference count using an OpenCL object. This is a book keeping functionality to make sure the system doesn’t remove an object before all instances that use this object have finished. Refer to Reference Count.
- Sampler
-
An object that describes how to sample an image when the image is read in the kernel. The image read functions take a sampler as an argument. The sampler specifies the image addressing-mode i.e. how out-of-range image coordinates are handled, the filter mode, and whether the input image coordinate is a normalized or unnormalized value.
- Scope inclusion
-
Two actions A and B are defined to have an inclusive scope if they have the same scope P such that: (1) if P is memory_scope_sub_group, and A and B are executed by work-items within the same sub-group, or (2) if P is memory_scope_work_group, and A and B are executed by work-items within the same work-group, or (3) if P is memory_scope_device, and A and B are executed by work-items on the same device, or (4) if P is memory_scope_all_svm_devices or memory_scope_all_devices, if A and B are executed by host threads or by work-items on one or more devices that can share SVM memory with each other and the host process.
- Sequenced before
-
A relation between evaluations executed by a single unit of execution. Sequenced-before is an asymmetric, transitive, pair-wise relation that induces a partial order between evaluations. Given any two evaluations A and B, if A is sequenced-before B, then the execution of A shall precede the execution of B.
- Sequential consistency
-
Sequential consistency interleaves the steps executed by each unit of execution. Each access to a memory location sees the last assignment to that location in that interleaving.
- Sequentially consistent semantics
-
One of the memory order semantics defined for synchronization operations. When using sequentially-consistent synchronization operations, the loads and stores within one unit of execution appear to execute in program order (i.e., the sequenced-before order), and loads and stores from different units of execution appear to be simply interleaved.
- Shared Virtual Memory (SVM)
-
An address space exposed to both the host and the devices within a context. SVM causes addresses to be meaningful between the host and all of the devices within a context and therefore supports the use of pointer based data structures in OpenCL kernels. It logically extends a portion of the global memory into the host address space therefore giving work-items access to the host address space. There are three types of SVM in OpenCL:
- Coarse-Grained buffer SVM
-
Sharing occurs at the granularity of regions of OpenCL buffer memory objects.
- Fine-Grained buffer SVM
-
Sharing occurs at the granularity of individual loads/stores into bytes within OpenCL buffer memory objects.
- Fine-Grained system SVM
-
Sharing occurs at the granularity of individual loads/stores into bytes occurring anywhere within the host memory.
- SIMD
-
Single Instruction Multiple Data. A programming model where a kernel is executed concurrently on multiple processing elements each with its own data and a shared program counter. All processing elements execute a strictly identical set of instructions.
- Specialization constants
-
Specialization constants are special constant objects that do not have known constant values in an intermediate language (e.g. SPIR-V). Applications may provide updated values for the specialization constants before a program is built. Specialization constants that do not receive a value from an application shall use the default specialization constant value.
- SPMD
-
Single Program Multiple Data. A programming model where a kernel is executed concurrently on multiple processing elements each with its own data and its own program counter. Hence, while all computational resources run the same kernel they maintain their own instruction counter and due to branches in a kernel, the actual sequence of instructions can be quite different across the set of processing elements.
- Sub-device
-
An OpenCL device can be partitioned into multiple sub-devices. The new sub-devices alias specific collections of compute units within the parent device, according to a partition scheme. The sub-devices may be used in any situation that their parent device may be used. Partitioning a device does not destroy the parent device, which may continue to be used along side and intermingled with its child sub-devices. Also see Device, Parent device and Root device.
- Sub-group
-
Sub-groups are an implementation-dependent grouping of work-items within a work-group. The size and number of sub-groups is implementation-defined.
- Sub-group Barrier
-
See Barrier.
- Submitted
-
The second state in the six state model for the execution of a command. The transition into this state occurs when the command is flushed from the command-queue and submitted for execution on the device. Once submitted, a programmer can assume a command will execute once its prerequisites have been met.
- SVM Buffer
-
A memory allocation enabled to work with Shared Virtual Memory (SVM). Depending on how the SVM buffer is created, it can be a coarse-grained or fine-grained SVM buffer. Optionally it may be wrapped by a Buffer Object. See Shared Virtual Memory (SVM).
- Synchronization
-
Synchronization refers to mechanisms that constrain the order of execution and the visibility of memory operations between two or more units of execution.
- Synchronization operations
-
Operations that define memory order constraints in a program. They play a special role in controlling how memory operations in one unit of execution (such as work-items or, when using SVM a host thread) are made visible to another. Synchronization operations in OpenCL include atomic operations and fences.
- Synchronization point
-
A synchronization point between a pair of commands (A and B) assures that results of command A happens-before command B is launched (i.e. enters the ready state) .
- Synchronizes with
-
A relation between operations in two different units of execution that defines a memory order constraint in global memory (global-synchronizes-with) or local memory (local-synchronizes-with).
- Task Parallel Programming Model
-
A programming model in which computations are expressed in terms of multiple concurrent tasks executing in one or more command-queues. The concurrent tasks can be running different kernels.
- Thread-safe
-
An OpenCL API call is considered to be thread-safe if the internal state as managed by OpenCL remains consistent when called simultaneously by multiple host threads. OpenCL API calls that are thread-safe allow an application to call these functions in multiple host threads without having to implement mutual exclusion across these host threads i.e. they are also re-entrant-safe.
- Undefined
-
The behavior of an OpenCL API call, built-in function used inside a kernel or execution of a kernel that is explicitly not defined by OpenCL. A conforming implementation is not required to specify what occurs when an undefined construct is encountered in OpenCL.
- Unit of execution
-
A generic term for a process, OS managed thread running on the host (a host-thread), kernel-instance, host program, work-item or any other executable agent that advances the work associated with a program.
- Valid Object
-
An OpenCL object is considered valid if it meets all of the following criteria:
-
The object was created by a successful call to an OpenCL API function.
-
The object has a strictly positive application-owned reference count.
-
The object has not had its backing memory changed outside of normal usage by the OpenCL implementation (e.g. corrupted by the application, a library it uses, the implementation itself, or any other agent that can access the object’s backing memory).
An object is only valid in the platform where it was created.
An OpenCL implementation must check for a
NULL
object to determine if an object is valid. The behavior for all other invalid objects is implementation-defined. -
- Work-group
-
A collection of related work-items that execute on a single compute unit. The work-items in the group execute the same kernel-instance and share local memory and work-group functions.
- Work-group Barrier
-
See Barrier.
- Work-group Function
-
A function that carries out collective operations across all the work-items in a work-group. Available collective operations are a barrier, reduction, broadcast, prefix sum, and evaluation of a predicate. A work-group function must occur within a converged control flow; i.e. all work-items in the work-group must encounter precisely the same work-group function.
- Work-group Synchronization
-
Constraints on the order of execution for work-items in a single work-group.
- Work-pool
-
A logical pool associated with a device that holds commands and work-groups from kernel-instances that are ready to execute. OpenCL does not constrain the order that commands and work-groups are scheduled for execution from the work-pool; i.e. a programmer must assume that they could be interleaved. There is one work-pool per device used by all command-queues associated with that device. The work-pool may be implemented in any manner as long as it assures that work-groups placed in the pool will eventually execute.
- Work-item
-
One of a collection of parallel executions of a kernel invoked on a device by a command. A work-item is executed by one or more processing elements as part of a work-group executing on a compute unit. A work-item is distinguished from other work-items by its global ID or the combination of its work-group ID and its local ID within a work-group.
3. The OpenCL Architecture
OpenCL is an open industry standard for programming a heterogeneous collection of CPUs, GPUs and other discrete computing devices organized into a single platform. It is more than a language. OpenCL is a framework for parallel programming and includes a language, API, libraries and a runtime system to support software development. Using OpenCL, for example, a programmer can write general purpose programs that execute on GPUs without the need to map their algorithms onto a 3D graphics API such as OpenGL or DirectX.
The target of OpenCL is expert programmers wanting to write portable yet efficient code. This includes library writers, middleware vendors, and performance oriented application programmers. Therefore OpenCL provides a low-level hardware abstraction plus a framework to support programming and many details of the underlying hardware are exposed.
To describe the core ideas behind OpenCL, we will use a hierarchy of models:
-
Platform Model
-
Memory Model
-
Execution Model
-
Programming Model
3.1. Platform Model
The Platform model for OpenCL is defined below. The model consists of a host connected to one or more OpenCL devices. An OpenCL device is divided into one or more compute units (CUs) which are further divided into one or more processing elements (PEs). Computations on a device occur within the processing elements.
An OpenCL application is implemented as both host code and device kernel code. The host code portion of an OpenCL application runs on a host processor according to the models native to the host platform. The OpenCL application host code submits the kernel code as commands from the host to OpenCL devices. An OpenCL device executes the commands computation on the processing elements within the device.
An OpenCL device has considerable latitude on how computations are mapped onto the devices processing elements. When processing elements within a compute unit execute the same sequence of statements across the processing elements, the control flow is said to be converged. Hardware optimized for executing a single stream of instructions over multiple processing elements is well suited to converged control flows. When the control flow varies from one processing element to another, it is said to be diverged. While a kernel always begins execution with a converged control flow, due to branching statements within a kernel, converged and diverged control flows may occur within a single kernel. This provides a great deal of flexibility in the algorithms that can be implemented with OpenCL.
Programmers may provide programs in the form of OpenCL C source strings, the SPIR-V intermediate language, or as implementation-defined binary objects. An OpenCL platform provides a compiler to translate programs of these forms into executable program objects. The device code compiler may be online or offline. An online compiler is available during host program execution using standard APIs. An offline compiler is invoked outside of host program control, using platform-specific methods. The OpenCL runtime allows developers to get a previously compiled device program executable and be able to load and execute a previously compiled device program executable.
OpenCL defines two kinds of platform profiles: a full profile and a reduced-functionality embedded profile. A full profile platform must provide an online compiler for all its devices. An embedded platform may provide an online compiler, but is not required to do so.
A device may expose special purpose functionality as a built-in kernel. The platform provides APIs for enumerating and invoking the built-in kernels offered by a device, but otherwise does not define their construction or semantics. A custom device supports only built-in kernels, and cannot be programmed via a kernel language.
Built-in kernels and custom devices are missing before version 1.2. |
All device types support the OpenCL execution model, the OpenCL memory model, and the APIs used in OpenCL to manage devices.
The platform model is an abstraction describing how OpenCL views the hardware. The relationship between the elements of the platform model and the hardware in a system may be a fixed property of a device or it may be a dynamic feature of a program dependent on how a compiler optimizes code to best utilize physical hardware.
3.2. Execution Model
The OpenCL execution model is defined in terms of two distinct units of execution: kernels that execute on one or more OpenCL devices and a host program that executes on the host. With regard to OpenCL, the kernels are where the "work" associated with a computation occurs. This work occurs through work-items that execute in groups (work-groups).
A kernel executes within a well-defined context managed by the host. The context defines the environment within which kernels execute. It includes the following resources:
-
Devices: One or more devices exposed by the OpenCL platform.
-
Kernel Objects: The OpenCL functions with their associated argument values that run on OpenCL devices.
-
Program Objects: The program source and executable that implement the kernels.
-
Memory Objects: Variables visible to the host and the OpenCL devices. Instances of kernels operate on these objects as they execute.
The host program uses the OpenCL API to create and manage the context. Functions from the OpenCL API enable the host to interact with a device through a command-queue. Each command-queue is associated with a single device. The commands placed into the command-queue fall into one of three types:
-
Kernel-enqueue commands: Enqueue a kernel for execution on a device.
-
Memory commands: Transfer data between the host and device memory, between memory objects, or map and unmap memory objects from the host address space.
-
Synchronization commands: Explicit synchronization points that define order constraints between commands.
In addition to commands submitted from the host command-queue, a kernel running on a device can enqueue commands to a device-side command queue. This results in child kernels enqueued by a kernel executing on a device (the parent kernel). Regardless of whether the command-queue resides on the host or a device, each command passes through six states.
-
Queued: The command is enqueued to a command-queue. A command may reside in the queue until it is flushed either explicitly (a call to clFlush) or implicitly by some other command.
-
Submitted: The command is flushed from the command-queue and submitted for execution on the device. Once flushed from the command-queue, a command will execute after any prerequisites for execution are met.
-
Ready: All prerequisites constraining execution of a command have been met. The command, or for a kernel-enqueue command the collection of work groups associated with a command, is placed in a device work-pool from which it is scheduled for execution.
-
Running: Execution of the command starts. For the case of a kernel-enqueue command, one or more work-groups associated with the command start to execute.
-
Ended: Execution of a command ends. When a Kernel-enqueue command ends, all of the work-groups associated with that command have finished their execution. Immediate side effects, i.e. those associated with the kernel but not necessarily with its child kernels, are visible to other units of execution. These side effects include updates to values in global memory.
-
Complete: The command and its child commands have finished execution and the status of the event object, if any, associated with the command is set to
CL_COMPLETE
.
The execution states and the transitions between them are summarized below. These states and the concept of a device work-pool are conceptual elements of the execution model. An implementation of OpenCL has considerable freedom in how these are exposed to a program. Five of the transitions, however, are directly observable through a profiling interface. These profiled states are shown below.
Commands communicate their status through Event objects.
Successful completion is indicated by setting the event status associated
with a command to CL_COMPLETE
.
Unsuccessful completion results in abnormal termination of the command which
is indicated by setting the event status to a negative value.
In this case, the command-queue associated with the abnormally terminated
command and all other command-queues in the same context may no longer be
available and their behavior is implementation defined.
A command submitted to a device will not launch until prerequisites that constrain the order of commands have been resolved. These prerequisites have three sources:
-
They may arise from commands submitted to a command-queue that constrain the order in which commands are launched. For example, commands that follow a command queue barrier will not launch until all commands prior to the barrier are complete.
-
The second source of prerequisites is dependencies between commands expressed through events. A command may include an optional list of events. The command will wait and not launch until all the events in the list are in the state CL COMPLETE. By this mechanism, event objects define order constraints between commands and coordinate execution between the host and one or more devices.
-
The third source of prerequisites can be the presence of non-trivial C initializers or C++ constructors for program scope global variables. In this case, OpenCL C/C++ compiler shall generate program initialization kernels that perform C initialization or C++ construction. These kernels must be executed by OpenCL runtime on a device before any kernel from the same program can be executed on the same device. The ND-range for any program initialization kernel is (1,1,1). When multiple programs are linked together, the order of execution of program initialization kernels that belong to different programs is undefined.
Program clean up may result in the execution of one or more program clean up kernels by the OpenCL runtime. This is due to the presence of non-trivial C++ destructors for program scope variables. The ND-range for executing any program clean up kernel is (1,1,1). The order of execution of clean up kernels from different programs (that are linked together) is undefined.
Program initialization and clean-up kernels are missing before version 2.2. |
Note that C initializers, C++ constructors, or C++ destructors for program scope variables cannot use pointers to coarse grain and fine grain SVM allocations.
A command may be submitted to a device and yet have no visible side effects outside of waiting on and satisfying event dependences. Examples include markers, kernels executed over ranges of no work-items or copy operations with zero sizes. Such commands may pass directly from the ready state to the ended state.
Command execution can be blocking or non-blocking. Consider a sequence of OpenCL commands. For blocking commands, the OpenCL API functions that enqueue commands don’t return until the command has completed. Alternatively, OpenCL functions that enqueue non-blocking commands return immediately and require that a programmer defines dependencies between enqueued commands to ensure that enqueued commands are not launched before needed resources are available. In both cases, the actual execution of the command may occur asynchronously with execution of the host program.
Commands within a single command-queue execute relative to each other in one of two modes:
-
In-order Execution: Commands and any side effects associated with commands appear to the OpenCL application as if they execute in the same order they are enqueued to a command-queue.
-
Out-of-order Execution: Commands execute in any order constrained only by explicit synchronization points (e.g. through command queue barriers) or explicit dependencies on events.
Multiple command-queues can be present within a single context. Multiple command-queues execute commands independently. Event objects visible to the host program can be used to define synchronization points between commands in multiple command queues. If such synchronization points are established between commands in multiple command-queues, an implementation must assure that the command-queues progress concurrently and correctly account for the dependencies established by the synchronization points. For a detailed explanation of synchronization points, see the execution model Synchronization section.
The core of the OpenCL execution model is defined by how the kernels execute. When a kernel-enqueue command submits a kernel for execution, an index space is defined. The kernel, the argument values associated with the arguments to the kernel, and the parameters that define the index space define a kernel-instance. When a kernel-instance executes on a device, the kernel function executes for each point in the defined index space. Each of these executing kernel functions is called a work-item. The work-items associated with a given kernel-instance are managed by the device in groups called work-groups. These work-groups define a coarse grained decomposition of the Index space. Work-groups are further divided into sub-groups, which provide an additional level of control over execution.
Sub-groups are missing before version 2.1. |
Work-items have a global ID based on their coordinates within the Index space. They can also be defined in terms of their work-group and the local ID within a work-group. The details of this mapping are described in the following section.
3.2.1. Mapping work-items onto an NDRange
The index space supported by OpenCL is called an NDRange. An NDRange is an N-dimensional index space, where N is one, two or three. The NDRange is decomposed into work-groups forming blocks that cover the Index space. An NDRange is defined by three integer arrays of length N:
-
The extent of the index space (or global size) in each dimension.
-
An offset index F indicating the initial value of the indices in each dimension (zero by default).
-
The size of a work-group (local size) in each dimension.
Each work-items global ID is an N-dimensional tuple. The global ID components are values in the range from F, to F plus the number of elements in that dimension minus one.
Unless a kernel comes from a source that disallows it, e.g. OpenCL C 1.x or
using -cl-uniform-work-group-size
, the size of work-groups in
an NDRange (the local size) need not be the same for all work-groups.
In this case, any single dimension for which the global size is not
divisible by the local size will be partitioned into two regions.
One region will have work-groups that have the same number of work-items as
was specified for that dimension by the programmer (the local size).
The other region will have work-groups with less than the number of work
items specified by the local size parameter in that dimension (the
remainder work-groups).
Work-group sizes could be non-uniform in multiple dimensions, potentially
producing work-groups of up to 4 different sizes in a 2D range and 8
different sizes in a 3D range.
Non-uniform work-group sizes are missing before version 2.0. |
Each work-item is assigned to a work-group and given a local ID to represent its position within the work-group. A work-item’s local ID is an N-dimensional tuple with components in the range from zero to the size of the work-group in that dimension minus one.
Work-groups are assigned IDs similarly. The number of work-groups in each dimension is not directly defined but is inferred from the local and global NDRanges provided when a kernel-instance is enqueued. A work-group’s ID is an N-dimensional tuple with components in the range 0 to the ceiling of the global size in that dimension divided by the local size in the same dimension. As a result, the combination of a work-group ID and the local-ID within a work-group uniquely defines a work-item. Each work-item is identifiable in two ways; in terms of a global index, and in terms of a work-group index plus a local index within a work-group.
For example, consider the 2-dimensional index space shown below. We input the index space for the work-items (Gx, Gy), the size of each work-group (Sx, Sy) and the global ID offset (Fx, Fy). The global indices define an Gxby Gy index space where the total number of work-items is the product of Gx and Gy. The local indices define an Sx by Sy index space where the number of work-items in a single work-group is the product of Sx and Sy. Given the size of each work-group and the total number of work-items we can compute the number of work-groups. A 2-dimensional index space is used to uniquely identify a work-group. Each work-item is identified by its global ID (gx, gy) or by the combination of the work-group ID (wx, wy), the size of each work-group (Sx,Sy) and the local ID (sx, sy) inside the work-group such that
-
(gx, gy) = (wx × Sx + sx + Fx, wy × Sy + sy + Fy)
The number of work-groups can be computed as:
-
(Wx, Wy) = (ceil(Gx / Sx), ceil(Gy / Sy))
Given a global ID and the work-group size, the work-group ID for a work-item is computed as:
-
(wx, wy) = ( (gx - sx - Fx) / Sx, (gy - sy - Fy) / Sy )
Within a work-group work-items may be divided into sub-groups. The mapping of work-items to sub-groups is implementation-defined and may be queried at runtime. While sub-groups may be used in multi-dimensional work-groups, each sub-group is 1-dimensional and any given work-item may query which sub-group it is a member of.
Sub-groups are missing before version 2.1. |
Work-items are mapped into sub-groups through a combination of compile-time decisions and the parameters of the dispatch. The mapping to sub-groups is invariant for the duration of a kernels execution, across dispatches of a given kernel with the same work-group dimensions, between dispatches and query operations consistent with the dispatch parameterization, and from one work-group to another within the dispatch (excluding the trailing edge work-groups in the presence of non-uniform work-group sizes). In addition, all sub-groups within a work-group will be the same size, apart from the sub-group with the maximum index which may be smaller if the size of the work-group is not evenly divisible by the size of the sub-groups.
In the degenerate case, a single sub-group must be supported for each work-group. In this situation all sub-group scope functions are equivalent to their work-group level equivalents.
3.2.2. Execution of kernel-instances
The work carried out by an OpenCL program occurs through the execution of kernel-instances on compute devices. To understand the details of OpenCL’s execution model, we need to consider how a kernel object moves from the kernel-enqueue command, into a command-queue, executes on a device, and completes.
A kernel object is defined as a function within the program object and a collection of arguments connecting the kernel to a set of argument values. The host program enqueues a kernel object to the command queue along with the NDRange and the work-group decomposition. These define a kernel-instance. In addition, an optional set of events may be defined when the kernel is enqueued. The events associated with a particular kernel-instance are used to constrain when the kernel-instance is launched with respect to other commands in the queue or to commands in other queues within the same context.
A kernel-instance is submitted to a device. For an in-order command queue, the kernel instances appear to launch and then execute in that same order; where we use the term appear to emphasize that when there are no dependencies between commands and hence differences in the order that commands execute cannot be observed in a program, an implementation can reorder commands even in an in-order command queue. For an out of order command-queue, kernel-instances wait to be launched until:
-
Synchronization commands enqueued prior to the kernel-instance are satisfied.
-
Each of the events in an optional event list defined when the kernel-instance was enqueued are set to
CL_COMPLETE
.
Once these conditions are met, the kernel-instance is launched and the
work-groups associated with the kernel-instance are placed into a pool of
ready to execute work-groups.
This pool is called a work-pool.
The work-pool may be implemented in any manner as long as it assures that
work-groups placed in the pool will eventually execute.
The device schedules work-groups from the work-pool for execution on the
compute units of the device.
The kernel-enqueue command is complete when all work-groups associated with
the kernel-instance end their execution, updates to global memory associated
with a command are visible globally, and the device signals successful
completion by setting the event associated with the kernel-enqueue command
to CL_COMPLETE
.
While a command-queue is associated with only one device, a single device may be associated with multiple command-queues all feeding into the single work-pool. A device may also be associated with command queues associated with different contexts within the same platform, again all feeding into the single work-pool. The device will pull work-groups from the work-pool and execute them on one or several compute units in any order; possibly interleaving execution of work-groups from multiple commands. A conforming implementation may choose to serialize the work-groups so a correct algorithm cannot assume that work-groups will execute in parallel. There is no safe and portable way to synchronize across the independent execution of work-groups since once in the work-pool, they can execute in any order.
The work-items within a single sub-group execute concurrently but not necessarily in parallel (i.e. they are not guaranteed to make independent forward progress). Therefore, only high-level synchronization constructs (e.g. sub-group functions such as barriers) that apply to all the work-items in a sub-group are well defined and included in OpenCL.
Sub-groups are missing before version 2.1. |
Sub-groups execute concurrently within a given work-group and with appropriate device support (see Querying Devices), may make independent forward progress with respect to each other, with respect to host threads and with respect to any entities external to the OpenCL system but running on an OpenCL device, even in the absence of work-group barrier operations. In this situation, sub-groups are able to internally synchronize using barrier operations without synchronizing with each other and may perform operations that rely on runtime dependencies on operations other sub-groups perform.
The work-items within a single work-group execute concurrently but are only guaranteed to make independent progress in the presence of sub-groups and device support. In the absence of this capability, only high-level synchronization constructs (e.g. work-group functions such as barriers) that apply to all the work-items in a work-group are well defined and included in OpenCL for synchronization within the work-group.
In the absence of synchronization functions (e.g. a barrier), work-items within a sub-group may be serialized. In the presence of sub -group functions, work-items within a sub -group may be serialized before any given sub -group function, between dynamically encountered pairs of sub-group functions and between a work-group function and the end of the kernel.
In the absence of independent forward progress of constituent sub-groups, work-items within a work-group may be serialized before, after or between work-group synchronization functions.
3.2.3. Device-side enqueue
Device-side enqueue is missing before version 2.0. |
Algorithms may need to generate additional work as they execute. In many cases, this additional work cannot be determined statically; so the work associated with a kernel only emerges at runtime as the kernel-instance executes. This capability could be implemented in logic running within the host program, but involvement of the host may add significant overhead and/or complexity to the application control flow. A more efficient approach would be to nest kernel-enqueue commands from inside other kernels. This nested parallelism can be realized by supporting the enqueuing of kernels on a device without direct involvement by the host program; so-called device-side enqueue.
Device-side kernel-enqueue commands are similar to host-side kernel-enqueue
commands.
The kernel executing on a device (the parent kernel) enqueues a
kernel-instance (the child kernel) to a device-side command queue.
This is an out-of-order command-queue and follows the same behavior as the
out-of-order command-queues exposed to the host program.
Commands enqueued to a device side command-queue generate and use events to
enforce order constraints just as for the command-queue on the host.
These events, however, are only visible to the parent kernel running on the
device.
When these prerequisite events take on the value CL_COMPLETE
, the
work-groups associated with the child kernel are launched into the devices
work pool.
The device then schedules them for execution on the compute units of the
device.
Child and parent kernels execute asynchronously.
However, a parent will not indicate that it is complete by setting its event
to CL_COMPLETE
until all child kernels have ended execution and have
signaled completion by setting any associated events to the value
CL_COMPLETE
.
Should any child kernel complete with an event status set to a negative
value (i.e. abnormally terminate), the parent kernel will abnormally
terminate and propagate the childs negative event value as the value of the
parents event.
If there are multiple children that have an event status set to a negative
value, the selection of which childs negative event value is propagated is
implementation-defined.
3.2.4. Synchronization
Synchronization refers to mechanisms that constrain the order of execution between two or more units of execution. Consider the following three domains of synchronization in OpenCL:
-
Work-group synchronization: Constraints on the order of execution for work-items in a single work-group
-
Sub-group synchronization: Constraints on the order of execution for work-items in a single sub-group. Note: Sub-groups are missing before version 2.1
-
Command synchronization: Constraints on the order of commands launched for execution
Synchronization across all work-items within a single work-group is carried out using a work-group function. These functions carry out collective operations across all the work-items in a work-group. Available collective operations are: barrier, reduction, broadcast, prefix sum, and evaluation of a predicate. A work-group function must occur within a converged control flow; i.e. all work-items in the work-group must encounter precisely the same work-group function. For example, if a work-group function occurs within a loop, the work-items must encounter the same work-group function in the same loop iterations. All the work-items of a work-group must execute the work-group function and complete reads and writes to memory before any are allowed to continue execution beyond the work-group function. Work-group functions that apply between work-groups are not provided in OpenCL since OpenCL does not define forward-progress or ordering relations between work-groups, hence collective synchronization operations are not well defined.
Synchronization across all work-items within a single sub-group is carried out using a sub-group function. These functions carry out collective operations across all the work-items in a sub-group. Available collective operations are: barrier, reduction, broadcast, prefix sum, and evaluation of a predicate. A sub-group function must occur within a converged control flow; i.e. all work-items in the sub-group must encounter precisely the same sub-group function. For example, if a work-group function occurs within a loop, the work-items must encounter the same sub-group function in the same loop iterations. All the work-items of a sub-group must execute the sub-group function and complete reads and writes to memory before any are allowed to continue execution beyond the sub-group function. Synchronization between sub-groups must either be performed using work-group functions, or through memory operations. Using memory operations for sub-group synchronization should be used carefully as forward progress of sub-groups relative to each other is only supported optionally by OpenCL implementations.
Command synchronization is defined in terms of distinct synchronization points. The synchronization points occur between commands in host command-queues and between commands in device-side command-queues. The synchronization points defined in OpenCL include:
-
Launching a command: A kernel-instance is launched onto a device after all events that kernel is waiting-on have been set to
CL_COMPLETE
. -
Ending a command: Child kernels may be enqueued such that they wait for the parent kernel to reach the end state before they can be launched. In this case, the ending of the parent command defines a synchronization point.
-
Completion of a command: A kernel-instance is complete after all of the work-groups in the kernel and all of its child kernels have completed. This is signaled to the host, a parent kernel or other kernels within command queues by setting the value of the event associated with a kernel to
CL_COMPLETE
. -
Blocking Commands: A blocking command defines a synchronization point between the unit of execution that calls the blocking API function and the enqueued command reaching the complete state.
-
Command-queue barrier: The command-queue barrier ensures that all previously enqueued commands have completed before subsequently enqueued commands can be launched.
-
clFinish: This function blocks until all previously enqueued commands in the command queue have completed after which clFinish defines a synchronization point and the clFinish function returns.
A synchronization point between a pair of commands (A and B) assures that results of command A happens-before command B is launched. This requires that any updates to memory from command A complete and are made available to other commands before the synchronization point completes. Likewise, this requires that command B waits until after the synchronization point before loading values from global memory. The concept of a synchronization point works in a similar fashion for commands such as a barrier that apply to two sets of commands. All the commands prior to the barrier must complete and make their results available to following commands. Furthermore, any commands following the barrier must wait for the commands prior to the barrier before loading values and continuing their execution.
These happens-before relationships are a fundamental part of the OpenCL 2.x memory model. When applied at the level of commands, they are straightforward to define at a language level in terms of ordering relationships between different commands. Ordering memory operations inside different commands, however, requires rules more complex than can be captured by the high level concept of a synchronization point. These rules are described in detail in Memory Ordering Rules.
3.2.5. Categories of Kernels
The OpenCL execution model supports three types of kernels:
-
OpenCL kernels are managed by the OpenCL API as kernel objects associated with kernel functions within program objects. OpenCL program objects are created and built using OpenCL APIs. The OpenCL API includes functions to query the kernel languages and and intermediate languages that may be used to create OpenCL program objects for a device.
-
Native kernels are accessed through a host function pointer. Native kernels are queued for execution along with OpenCL kernels on a device and share memory objects with OpenCL kernels. For example, these native kernels could be functions defined in application code or exported from a library. The ability to execute native kernels is optional within OpenCL and the semantics of native kernels are implementation-defined. The OpenCL API includes functions to query capabilities of a device to determine if this capability is supported.
-
Built-in kernels are tied to particular device and are not built at runtime from source code in a program object. The common use of built in kernels is to expose fixed-function hardware or firmware associated with a particular OpenCL device or custom device. The semantics of a built-in kernel may be defined outside of OpenCL and hence are implementation defined. Note: Built-in kernels are missing before version 1.2.
All three types of kernels are manipulated through the OpenCL command queues and must conform to the synchronization points defined in the OpenCL execution model.
3.3. Memory Model
The OpenCL memory model describes the structure, contents, and behavior of the memory exposed by an OpenCL platform as an OpenCL program runs. The model allows a programmer to reason about values in memory as the host program and multiple kernel-instances execute.
An OpenCL program defines a context that includes a host, one or more devices, command-queues, and memory exposed within the context. Consider the units of execution involved with such a program. The host program runs as one or more host threads managed by the operating system running on the host (the details of which are defined outside of OpenCL). There may be multiple devices in a single context which all have access to memory objects defined by OpenCL. On a single device, multiple work-groups may execute in parallel with potentially overlapping updates to memory. Finally, within a single work-group, multiple work-items concurrently execute, once again with potentially overlapping updates to memory.
The memory model must precisely define how the values in memory as seen from each of these units of execution interact so a programmer can reason about the correctness of OpenCL programs. We define the memory model in four parts.
-
Memory regions: The distinct memories visible to the host and the devices that share a context.
-
Memory objects: The objects defined by the OpenCL API and their management by the host and devices.
-
Shared Virtual Memory: A virtual address space exposed to both the host and the devices within a context. Note: SVM is missing before version 2.0.
-
Consistency Model: Rules that define which values are observed when multiple units of execution load data from memory plus the atomic/fence operations that constrain the order of memory operations and define synchronization relationships.
3.3.1. Fundamental Memory Regions
Memory in OpenCL is divided into two parts.
-
Host Memory: The memory directly available to the host. The detailed behavior of host memory is defined outside of OpenCL. Memory objects move between the Host and the devices through functions within the OpenCL API or through a shared virtual memory interface.
-
Device Memory: Memory directly available to kernels executing on OpenCL devices.
Device memory consists of four named address spaces or memory regions:
-
Global Memory: This memory region permits read/write access to all work-items in all work-groups running on any device within a context. Work-items can read from or write to any element of a memory object. Reads and writes to global memory may be cached depending on the capabilities of the device.
-
Constant Memory: A region of global memory that remains constant during the execution of a kernel-instance. The host allocates and initializes memory objects placed into constant memory.
-
Local Memory: A memory region local to a work-group. This memory region can be used to allocate variables that are shared by all work-items in that work-group.
-
Private Memory: A region of memory private to a work-item. Variables defined in one work-items private memory are not visible to another work-item.
The memory regions and their relationship to the OpenCL Platform model are summarized below. Local and private memories are always associated with a particular device. The global and constant memories, however, are shared between all devices within a given context. An OpenCL device may include a cache to support efficient access to these shared memories.
To understand memory in OpenCL, it is important to appreciate the relationships between these named address spaces. The four named address spaces available to a device are disjoint meaning they do not overlap. This is a logical relationship, however, and an implementation may choose to let these disjoint named address spaces share physical memory.
Programmers often need functions callable from kernels where the pointers manipulated by those functions can point to multiple named address spaces. This saves a programmer from the error-prone and wasteful practice of creating multiple copies of functions; one for each named address space. Therefore the global, local and private address spaces belong to a single generic address space. This is closely modeled after the concept of a generic address space used in the embedded C standard (ISO/IEC 9899:1999). Since they all belong to a single generic address space, the following properties are supported for pointers to named address spaces in device memory:
-
A pointer to the generic address space can be cast to a pointer to a global, local or private address space
-
A pointer to a global, local or private address space can be cast to a pointer to the generic address space.
-
A pointer to a global, local or private address space can be implicitly converted to a pointer to the generic address space, but the converse is not allowed.
The constant address space is disjoint from the generic address space.
The generic address space is missing before version 2.0. |
The addresses of memory associated with memory objects in Global memory are not preserved between kernel instances, between a device and the host, and between devices. In this regard global memory acts as a global pool of memory objects rather than an address space. This restriction is relaxed when shared virtual memory (SVM) is used.
Shared virtual memory is missing before version 2.0. |
SVM causes addresses to be meaningful between the host and all of the devices within a context hence supporting the use of pointer based data structures in OpenCL kernels. It logically extends a portion of the global memory into the host address space giving work-items access to the host address space. On platforms with hardware support for a shared address space between the host and one or more devices, SVM may also provide a more efficient way to share data between devices and the host. Details about SVM are presented in Shared Virtual Memory.
A programmer may use the features of the memory consistency model to manage safe access to global memory from multiple work-items potentially running on one or more devices. In addition, when using shared virtual memory (SVM), the memory consistency model may also be used to ensure that host threads safely access memory locations in the shared memory region.
3.3.2. Memory Objects
The contents of global memory are memory objects. A memory object is a handle to a reference counted region of global memory. Memory objects use the OpenCL type cl_mem and fall into three distinct classes.
-
Buffer: A memory object stored as a block of contiguous memory and used as a general purpose object to hold data used in an OpenCL program. The types of the values within a buffer may be any of the built in types (such as int, float), vector types, or user-defined structures. The buffer can be manipulated through pointers much as one would with any block of memory in C.
-
Image: An image memory object holds one, two or three dimensional images. The formats are based on the standard image formats used in graphics applications. An image is an opaque data structure managed by functions defined in the OpenCL API. To optimize the manipulation of images stored in the texture memories found in many GPUs, OpenCL kernels have traditionally been disallowed from both reading and writing a single image. In OpenCL 2.0, however, we have relaxed this restriction by providing synchronization and fence operations that let programmers properly synchronize their code to safely allow a kernel to read and write a single image.
-
Pipe: The pipe memory object conceptually is an ordered sequence of data items. A pipe has two endpoints: a write endpoint into which data items are inserted, and a read endpoint from which data items are removed. At any one time, only one kernel instance may write into a pipe, and only one kernel instance may read from a pipe. To support the producer consumer design pattern, one kernel instance connects to the write endpoint (the producer) while another kernel instance connects to the reading endpoint (the consumer). Note: The pipe memory object is missing before version 2.0.
Memory objects are allocated by host APIs.
The host program can provide the runtime with a pointer to a block of
continuous memory to hold the memory object when the object is created
(CL_MEM_
).
Alternatively, the physical memory can be managed by the OpenCL runtime and
not be directly accessible to the host program.
Allocation and access to memory objects within the different memory regions varies between the host and work-items running on a device. This is summarized in the Memory Regions table, which describes whether the kernel or the host can allocate from a memory region, the type of allocation (static at compile time vs. dynamic at runtime) and the type of access allowed (i.e. whether the kernel or the host can read and/or write to a memory region).
Global | Constant | Local | Private | ||
---|---|---|---|---|---|
Host |
Allocation |
Dynamic |
Dynamic |
Dynamic |
None |
Access |
Read/Write to Buffers and Images, but not Pipes |
Read/Write |
None |
None |
|
Kernel |
Allocation |
Static (program scope variables) |
Static (program scope variables) |
Static for parent kernel, Dynamic for child kernels |
Static |
Access |
Read/Write |
Read-only |
Read/Write, No access to child kernel memory |
Read/Write |
The Memory Regions table shows the different memory regions in OpenCL and how memory objects are allocated and accessed by the host and by an executing instance of a kernel. For kernels, we distinguish between the behavior of local memory for a parent kernel and its child kernels.
Once allocated, a memory object is made available to kernel-instances running on one or more devices. In addition to Shared Virtual Memory, there are three basic ways to manage the contents of buffers between the host and devices.
-
Read/Write/Fill commands: The data associated with a memory object is explicitly read and written between the host and global memory regions using commands enqueued to an OpenCL command queue. Note: Fill commands are missing before version 1.2.
-
Map/Unmap commands: Data from the memory object is mapped into a contiguous block of memory accessed through a host accessible pointer. The host program enqueues a map command on block of a memory object before it can be safely manipulated by the host program. When the host program is finished working with the block of memory, the host program enqueues an unmap command to allow a kernel-instance to safely read and/or write the buffer.
-
Copy commands: The data associated with a memory object is copied between two buffers, each of which may reside either on the host or on the device.
With Read/Write/Map, the commands can be blocking or non-blocking operations. The OpenCL function call for a blocking memory transfer returns once the command (memory transfer) has completed. At this point the associated memory resources on the host can be safely reused, and following operations on the host are guaranteed that the transfer has already completed. For a non-blocking memory transfer, the OpenCL function call returns as soon as the command is enqueued.
Memory objects are bound to a context and hence can appear in multiple kernel-instances running on more than one physical device. The OpenCL platform must support a large range of hardware platforms including systems that do not support a single shared address space in hardware; hence the ways memory objects can be shared between kernel-instances is restricted. The basic principle is that multiple read operations on memory objects from multiple kernel-instances that overlap in time are allowed, but mixing overlapping reads and writes into the same memory objects from different kernel instances is only allowed when fine grained synchronization is used with Shared Virtual Memory.
When global memory is manipulated by multiple kernel-instances running on multiple devices, the OpenCL runtime system must manage the association of memory objects with a given device. In most cases the OpenCL runtime will implicitly associate a memory object with a device. A kernel instance is naturally associated with the command queue to which the kernel was submitted. Since a command-queue can only access a single device, the queue uniquely defines which device is involved with any given kernel-instance; hence defining a clear association between memory objects, kernel-instances and devices. Programmers may anticipate these associations in their programs and explicitly manage association of memory objects with devices in order to improve performance.
3.3.3. Shared Virtual Memory
Shared virtual memory is missing before version 2.0. |
OpenCL extends the global memory region into the host memory region through a shared virtual memory (SVM) mechanism. There are three types of SVM in OpenCL
-
Coarse-Grained buffer SVM: Sharing occurs at the granularity of regions of OpenCL buffer memory objects. Consistency is enforced at synchronization points and with map/unmap commands to drive updates between the host and the device. This form of SVM is similar to non-SVM use of memory; however, it lets kernel-instances share pointer-based data structures (such as linked-lists) with the host program. Program scope global variables are treated as per-device coarse-grained SVM for addressing and sharing purposes.
-
Fine-Grained buffer SVM: Sharing occurs at the granularity of individual loads/stores into bytes within OpenCL buffer memory objects. Loads and stores may be cached. This means consistency is guaranteed at synchronization points. If the optional OpenCL atomics are supported, they can be used to provide fine-grained control of memory consistency.
-
Fine-Grained system SVM: Sharing occurs at the granularity of individual loads/stores into bytes occurring anywhere within the host memory. Loads and stores may be cached so consistency is guaranteed at synchronization points. If the optional OpenCL atomics are supported, they can be used to provide fine-grained control of memory consistency.
Granularity of sharing | Memory Allocation | Mechanisms to enforce Consistency | Explicit updates between host and device | |
---|---|---|---|---|
Non-SVM buffers |
OpenCL Memory objects(buffer) |
Host synchronization points on the same or between devices. |
yes, through Map and Unmap commands. |
|
Coarse-Grained buffer SVM |
OpenCL Memory objects (buffer) |
Host synchronization points between devices |
yes, through Map and Unmap commands. |
|
Fine-Grained buffer SVM |
Bytes within OpenCL Memory objects (buffer) |
Synchronization points plus atomics (if supported) |
No |
|
Fine-Grained system SVM |
Bytes within Host memory (system) |
Host memory allocation mechanisms (e.g. malloc) |
Synchronization points plus atomics (if supported) |
No |
Coarse-Grained buffer SVM is required in the core OpenCL specification. The two finer grained approaches are optional features in OpenCL. The various SVM mechanisms to access host memory from the work-items associated with a kernel instance are summarized above.
3.3.4. Memory Consistency Model for OpenCL 1.x
This memory consistency model is deprecated by version 2.0. |
OpenCL 1.x uses a relaxed consistency memory model; i.e. the state of memory visible to a work-item is not guaranteed to be consistent across the collection of work-items at all times.
Within a work-item memory has load / store consistency. Local memory is consistent across work-items in a single work-group at a work-group barrier. Global memory is consistent across work-items in a single work-group at a work-group barrier, but there are no guarantees of memory consistency between different work-groups executing a kernel.
Memory consistency for memory objects shared between enqueued commands is enforced at a synchronization point.
3.3.5. Memory Consistency Model for OpenCL 2.x
This memory consistency model is missing before version 2.0. |
The OpenCL 2.x memory model tells programmers what they can expect from an OpenCL 2.x implementation; which memory operations are guaranteed to happen in which order and which memory values each read operation will return. The memory model tells compiler writers which restrictions they must follow when implementing compiler optimizations; which variables they can cache in registers and when they can move reads or writes around a barrier or atomic operation. The memory model also tells hardware designers about limitations on hardware optimizations; for example, when they must flush or invalidate hardware caches.
The memory consistency model in OpenCL 2.x is based on the memory model from the ISO C11 programming language. To help make the presentation more precise and self-contained, we include modified paragraphs taken verbatim from the ISO C11 international standard. When a paragraph is taken or modified from the C11 standard, it is identified as such along with its original location in the C11 standard.
For programmers, the most intuitive model is the sequential consistency memory model. Sequential consistency interleaves the steps executed by each of the units of execution. Each access to a memory location sees the last assignment to that location in that interleaving. While sequential consistency is relatively straightforward for a programmer to reason about, implementing sequential consistency is expensive. Therefore, OpenCL 2.x implements a relaxed memory consistency model; i.e. it is possible to write programs where the loads from memory violate sequential consistency. Fortunately, if a program does not contain any races and if the program only uses atomic operations that utilize the sequentially consistent memory order (the default memory ordering for OpenCL 2.x), OpenCL programs appear to execute with sequential consistency.
Programmers can to some degree control how the memory model is relaxed by choosing the memory order for synchronization operations. The precise semantics of synchronization and the memory orders are formally defined in Memory Ordering Rules. Here, we give a high level description of how these memory orders apply to atomic operations on atomic objects shared between units of execution. OpenCL 2.x memory_order choices are based on those from the ISO C11 standard memory model. They are specified in certain OpenCL functions through the following enumeration constants:
-
memory_order_relaxed: implies no order constraints. This memory order can be used safely to increment counters that are concurrently incremented, but it doesn’t guarantee anything about the ordering with respect to operations to other memory locations. It can also be used, for example, to do ticket allocation and by expert programmers implementing lock-free algorithms.
-
memory_order_acquire: A synchronization operation (fence or atomic) that has acquire semantics "acquires" side-effects from a release operation that synchronises with it: if an acquire synchronises with a release, the acquiring unit of execution will see all side-effects preceding that release (and possibly subsequent side-effects.) As part of carefully-designed protocols, programmers can use an "acquire" to safely observe the work of another unit of execution.
-
memory_order_release: A synchronization operation (fence or atomic operation) that has release semantics "releases" side effects to an acquire operation that synchronises with it. All side effects that precede the release are included in the release. As part of carefully-designed protocols, programmers can use a "release" to make changes made in one unit of execution visible to other units of execution.
In general, no acquire must always synchronise with any particular release. However, synchronisation can be forced by certain executions. See the description of Fence Operations for detailed rules for when synchronisation must occur. |
-
memory_order_acq_rel: A synchronization operation with acquire-release semantics has the properties of both the acquire and release memory orders. It is typically used to order read-modify-write operations.
-
memory_order_seq_cst: The loads and stores of each unit of execution appear to execute in program (i.e., sequenced-before) order, and the loads and stores from different units of execution appear to be simply interleaved.
Regardless of which memory_order is specified, resolving constraints on memory operations across a heterogeneous platform adds considerable overhead to the execution of a program. An OpenCL platform may be able to optimize certain operations that depend on the features of the memory consistency model by restricting the scope of the memory operations. Distinct memory scopes are defined by the values of the memory_scope enumeration constant:
-
memory_scope_work_item: memory-ordering constraints only apply within the work-item [1].
-
memory_scope_sub_group: memory-ordering constraints only apply within the sub-group.
-
memory_scope_work_group: memory-ordering constraints only apply to work-items executing within a single work-group.
-
memory_scope_device: memory-ordering constraints only apply to work-items executing on a single device
-
memory_scope_all_svm_devices: memory-ordering constraints apply to work-items executing across multiple devices and (when using SVM) the host. A release performed with memory_scope_all_svm_devices to a buffer that does not have the
CL_MEM_
flag set will commit to at least memory_scope_device visibility, with full synchronization of the buffer at a queue synchronization point (e.g. an OpenCL event).SVM_ ATOMICS -
memory_scope_all_devices: an alias for memory_scope_all_svm_devices.
These memory scopes define a hierarchy of visibilities when analyzing the ordering constraints of memory operations. For example if a programmer knows that a sequence of memory operations will only be associated with a collection of work-items from a single work-group (and hence will run on a single device), the implementation is spared the overhead of managing the memory orders across other devices within the same context. This can substantially reduce overhead in a program. All memory scopes are valid when used on global memory or local memory. For local memory, all visibility is constrained to within a given work-group and scopes wider than memory_scope_work_group carry no additional meaning.
In the following subsections (leading up to OpenCL Framework), we will explain the synchronization constructs and detailed rules needed to use OpenCL’s 2.x relaxed memory models. It is important to appreciate, however, that many programs do not benefit from relaxed memory models. Even expert programmers have a difficult time using atomics and fences to write correct programs with relaxed memory models. A large number of OpenCL programs can be written using a simplified memory model. This is accomplished by following these guidelines.
-
Write programs that manage safe sharing of global memory objects through the synchronization points defined by the command queues.
-
Restrict low level synchronization inside work-groups to the work-group functions such as barrier.
-
If you want sequential consistency behavior with system allocations or fine-grain SVM buffers with atomics support, use only memory_order_seq_cst operations with the scope memory_scope_all_svm_devices.
-
If you want sequential consistency behavior when not using system allocations or fine-grain SVM buffers with atomics support, use only memory_order_seq_cst operations with the scope memory_scope_device or memory_scope_all_svm_devices.
-
Ensure your program has no races.
If these guidelines are followed in your OpenCL programs, you can skip the detailed rules behind the relaxed memory models and go directly to OpenCL Framework.
3.3.6. Overview of atomic and fence operations
OpenCL 2.x has a number of synchronization operations that are used to define memory order constraints in a program. They play a special role in controlling how memory operations in one unit of execution (such as work-items or, when using SVM a host thread) are made visible to another. There are two types of synchronization operations in OpenCL; atomic operations and fences.
Atomic operations are indivisible. They either occur completely or not at all. These operations are used to order memory operations between units of execution and hence they are parameterized with the memory_order and memory_scope parameters defined by the OpenCL memory consistency model. The atomic operations for OpenCL kernel languages are similar to the corresponding operations defined by the C11 standard.
The OpenCL 2.x atomic operations apply to variables of an atomic type (a subset of those in the C11 standard) including atomic versions of the int, uint, long, ulong, float, double, half, intptr_t, uintptr_t, size_t, and ptrdiff_t types. However, support for some of these atomic types depends on support for the corresponding regular types.
An atomic operation on one or more memory locations is either an acquire operation, a release operation, or both an acquire and release operation. An atomic operation without an associated memory location is a fence and can be either an acquire fence, a release fence, or both an acquire and release fence. In addition, there are relaxed atomic operations, which do not have synchronization properties, and atomic read-modify-write operations, which have special characteristics. [C11 standard, Section 5.1.2.4, paragraph 5, modified.]
The orders memory_order_acquire (used for reads), memory_order_release (used for writes), and memory_order_acq_rel (used for read-modify-write operations) are used for simple communication between units of execution using shared variables. Informally, executing a memory_order_release on an atomic object A makes all previous side effects visible to any unit of execution that later executes a memory_order_acquire on A. The orders memory_order_acquire, memory_order_release, and memory_order_acq_rel do not provide sequential consistency for race-free programs because they will not ensure that atomic stores followed by atomic loads become visible to other threads in that order.
The fence operation is atomic_work_item_fence, which includes a memory_order argument as well as the memory_scope and cl_mem_fence_flags arguments. Depending on the memory_order argument, this operation:
-
has no effects, if memory_order_relaxed;
-
is an acquire fence, if memory_order_acquire;
-
is a release fence, if memory_order_release;
-
is both an acquire fence and a release fence, if memory_order_acq_rel;
-
is a sequentially-consistent fence with both acquire and release semantics, if memory_order_seq_cst.
If specified, the cl_mem_fence_flags argument must be CLK_IMAGE_MEM_FENCE
,
CLK_GLOBAL_MEM_FENCE
, CLK_LOCAL_MEM_FENCE
, or CLK_GLOBAL_MEM_FENCE |
CLK_LOCAL_MEM_FENCE
.
The atomic_work_item_fence(CLK_IMAGE_MEM_FENCE, …)
built-in function must be
used to make sure that sampler-less writes are visible to later reads by the
same work-item.
Without use of the atomic_work_item_fence function, write-read coherence on
image objects is not guaranteed: if a work-item reads from an image to which
it has previously written without an intervening atomic_work_item_fence, it
is not guaranteed that those previous writes are visible to the work-item.
The synchronization operations in OpenCL 2.x can be parameterized by a memory_scope. Memory scopes control the extent that an atomic operation or fence is visible with respect to the memory model. These memory scopes may be used when performing atomic operations and fences on global memory and local memory. When used on global memory visibility is bounded by the capabilities of that memory. When used on a fine-grained non-atomic SVM buffer, a coarse-grained SVM buffer, or a non-SVM buffer, operations parameterized with memory_scope_all_svm_devices will behave as if they were parameterized with memory_scope_device. When used on local memory, visibility is bounded by the work-group and, as a result, memory_scope with wider visibility than memory_scope_work_group will be reduced to memory_scope_work_group.
Two actions A and B are defined to have an inclusive scope if they have the same scope P such that:
-
P is memory_scope_sub_group and A and B are executed by work-items within the same sub-group.
-
P is memory_scope_work_group and A and B are executed by work-items within the same work-group.
-
P is memory_scope_device and A and B are executed by work-items on the same device when A and B apply to an SVM allocation or A and B are executed by work-items in the same kernel or one of its children when A and B apply to a
cl_mem
buffer. -
P is memory_scope_all_svm_devices if A and B are executed by host threads or by work-items on one or more devices that can share SVM memory with each other and the host process.
3.3.7. Memory Ordering Rules
Fundamentally, the issue in a memory model is to understand the orderings in time of modifications to objects in memory. Modifying an object or calling a function that modifies an object are side effects, i.e. changes in the state of the execution environment. Evaluation of an expression in general includes both value computations and initiation of side effects. Value computation for an lvalue expression includes determining the identity of the designated object. [C11 standard, Section 5.1.2.3, paragraph 2, modified.]
We assume that the OpenCL kernel language and host programming languages have a sequenced-before relation between the evaluations executed by a single unit of execution. This sequenced-before relation is an asymmetric, transitive, pair-wise relation between those evaluations, which induces a partial order among them. Given any two evaluations A and B, if A is sequenced-before B, then the execution of A shall precede the execution of B. (Conversely, if A is sequenced-before B, then B is sequenced-after A.) If A is not sequenced-before or sequenced-after B, then A and B are unsequenced. Evaluations A and B are indeterminately sequenced when A is either sequenced-before or sequenced-after B, but it is unspecified which. [C11 standard, Section 5.1.2.3, paragraph 3, modified.]
Sequenced-before is a partial order of the operations executed by a single unit of execution (e.g. a host thread or work-item). It generally corresponds to the source program order of those operations, and is partial because of the undefined argument evaluation order of the OpenCL C kernel language. |
In an OpenCL kernel language, the value of an object visible to a work-item W at a particular point is the initial value of the object, a value stored in the object by W, or a value stored in the object by another work-item or host thread, according to the rules below. Depending on details of the host programming language, the value of an object visible to a host thread may also be the value stored in that object by another work-item or host thread. [C11 standard, Section 5.1.2.4, paragraph 2, modified.]
Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location. [C11 standard, Section 5.1.2.4, paragraph 4.]
All modifications to a particular atomic object M occur in some particular total order, called the modification order of M. If A and B are modifications of an atomic object M, and A happens-before B, then A shall precede B in the modification order of M, which is defined below. Note that the modification order of an atomic object M is independent of whether M is in local or global memory. [C11 standard, Section 5.1.2.4, paragraph 7, modified.]
A release sequence begins with a release operation A on an atomic object M and is the maximal contiguous sub-sequence of side effects in the modification order of M, where the first operation is A and every subsequent operation either is performed by the same work-item or host thread that performed the release or is an atomic read-modify-write operation. [C11 standard, Section 5.1.2.4, paragraph 10, modified.]
OpenCL’s local and global memories are disjoint. Kernels may access both kinds of memory while host threads may only access global memory. Furthermore, the flags argument of OpenCL’s work_group_barrier function specifies which memory operations the function will make visible: these memory operations can be, for example, just the ones to local memory, or the ones to global memory, or both. Since the visibility of memory operations can be specified for local memory separately from global memory, we define two related but independent relations, global-synchronizes-with and local-synchronizes-with. Certain operations on global memory may global-synchronize-with other operations performed by another work-item or host thread. An example is a release atomic operation in one work- item that global-synchronizes-with an acquire atomic operation in a second work-item. Similarly, certain atomic operations on local objects in kernels can local-synchronize- with other atomic operations on those local objects. [C11 standard, Section 5.1.2.4, paragraph 11, modified.]
We define two separate happens-before relations: global-happens-before and local-happens-before.
A global memory action A global-happens-before a global memory action B if
-
A is sequenced before B, or
-
A global-synchronizes-with B, or
-
For some global memory action C, A global-happens-before C and C global-happens-before B.
A local memory action A local-happens-before a local memory action B if
-
A is sequenced before B, or
-
A local-synchronizes-with B, or
-
For some local memory action C, A local-happens-before C and C local-happens-before B.
An OpenCL 2.x implementation shall ensure that no program execution demonstrates a cycle in either the local-happens-before relation or the global-happens-before relation.
The global- and local-happens-before relations are critical to defining what values are read and when data races occur. The global-happens-before relation, for example, defines what global memory operations definitely happen before what other global memory operations. If an operation A global-happens-before operation B then A must occur before B; in particular, any write done by A will be visible to B. The local-happens-before relation has similar properties for local memory. Programmers can use the local- and global-happens-before relations to reason about the order of program actions. |
A visible side effect A on a global object M with respect to a value computation B of M satisfies the conditions:
-
A global-happens-before B, and
-
there is no other side effect X to M such that A global-happens-before X and X global-happens-before B.
We define visible side effects for local objects M similarly. The value of a non-atomic scalar object M, as determined by evaluation B, shall be the value stored by the visible side effect A. [C11 standard, Section 5.1.2.4, paragraph 19, modified.]
The execution of a program contains a data race if it contains two conflicting actions A and B in different units of execution, and
-
(1) at least one of A or B is not atomic, or A and B do not have inclusive memory scope, and
-
(2) the actions are global actions unordered by the global-happens-before relation or are local actions unordered by the local-happens-before relation.
Any such data race results in undefined behavior. [C11 standard, Section 5.1.2.4, paragraph 25, modified.]
We also define the visible sequence of side effects on local and global atomic objects. The remaining paragraphs of this subsection define this sequence for a global atomic object M; the visible sequence of side effects for a local atomic object is defined similarly by using the local-happens-before relation.
The visible sequence of side effects on a global atomic object M, with respect to a value computation B of M, is a maximal contiguous sub-sequence of side effects in the modification order of M, where the first side effect is visible with respect to B, and for every side effect, it is not the case that B global-happens-before it. The value of M, as determined by evaluation B, shall be the value stored by some operation in the visible sequence of M with respect to B. [C11 standard, Section 5.1.2.4, paragraph 22, modified.]
If an operation A that modifies an atomic object M global-happens before an operation B that modifies M, then A shall be earlier than B in the modification order of M. This requirement is known as write-write coherence.
If a value computation A of an atomic object M global-happens-before a value computation B of M, and A takes its value from a side effect X on M, then the value computed by B shall either equal the value stored by X, or be the value stored by a side effect Y on M, where Y follows X in the modification order of M. This requirement is known as read-read coherence. [C11 standard, Section 5.1.2.4, paragraph 22, modified.]
If a value computation A of an atomic object M global-happens-before an operation B on M, then A shall take its value from a side effect X on M, where X precedes B in the modification order of M. This requirement is known as read-write coherence.
If a side effect X on an atomic object M global-happens-before a value computation B of M, then the evaluation B shall take its value from X or from a side effect Y that follows X in the modification order of M. This requirement is known as write-read coherence.
3.3.7.1. Atomic Operations
This and following sections describe how different program actions in kernel C code and the host program contribute to the local- and global-happens-before relations. This section discusses ordering rules for OpenCL 2.x atomic operations.
Device-side enqueue defines the enumerated type memory_order.
-
For memory_order_relaxed, no operation orders memory.
-
For memory_order_release, memory_order_acq_rel, and memory_order_seq_cst, a store operation performs a release operation on the affected memory location.
-
For memory_order_acquire, memory_order_acq_rel, and memory_order_seq_cst, a load operation performs an acquire operation on the affected memory location. [C11 standard, Section 7.17.3, paragraphs 2-4, modified.]
Certain built-in functions synchronize with other built-in functions performed by another unit of execution. This is true for pairs of release and acquire operations under specific circumstances. An atomic operation A that performs a release operation on a global object M global-synchronizes-with an atomic operation B that performs an acquire operation on M and reads a value written by any side effect in the release sequence headed by A. A similar rule holds for atomic operations on objects in local memory: an atomic operation A that performs a release operation on a local object M local-synchronizes-with an atomic operation B that performs an acquire operation on M and reads a value written by any side effect in the release sequence headed by A. [C11 standard, Section 5.1.2.4, paragraph 11, modified.]
Atomic operations specifying memory_order_relaxed are relaxed only with respect to memory ordering. Implementations must still guarantee that any given atomic access to a particular atomic object be indivisible with respect to all other atomic accesses to that object. |
There shall exist a single total order S for all memory_order_seq_cst operations that is consistent with the modification orders for all affected locations, as well as the appropriate global-happens-before and local-happens-before orders for those locations, such that each memory_order_seq_cst operation B that loads a value from an atomic object M in global or local memory observes one of the following values:
-
the result of the last modification A of M that precedes B in S, if it exists, or
-
if A exists, the result of some modification of M in the visible sequence of side effects with respect to B that is not memory_order_seq_cst and that does not happen before A, or
-
if A does not exist, the result of some modification of M in the visible sequence of side effects with respect to B that is not memory_order_seq_cst. [C11 standard, Section 7.17.3, paragraph 6, modified.]
Let X and Y be two memory_order_seq_cst operations. If X local-synchronizes-with or global-synchronizes-with Y then X both local-synchronizes-with Y and global-synchronizes-with Y.
If the total order S exists, the following rules hold:
-
For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_cst fence X sequenced-before B, then B observes either the last memory_order_seq_cst modification of M preceding X in the total order S or a later modification of M in its modification order. [C11 standard, Section 7.17.3, paragraph 9.]
-
For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory_order_seq_cst fence X such that A is sequenced-before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order. [C11 standard, Section 7.17.3, paragraph 10.]
-
For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced-before X, Y is sequenced-before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order. [C11 standard, Section 7.17.3, paragraph 11.]
-
For atomic operations A and B on an atomic object M, if there are memory_order_seq_cst fences X and Y such that A is sequenced-before X, Y is sequenced-before B, and X precedes Y in S, then B occurs later than A in the modification order of M.
memory_order_seq_cst ensures sequential consistency only for a program that is (1) free of data races, and (2) exclusively uses memory_order_seq_cst synchronization operations. Any use of weaker ordering will invalidate this guarantee unless extreme care is used. In particular, memory_order_seq_cst fences ensure a total order only for the fences themselves. Fences cannot, in general, be used to restore sequential consistency for atomic operations with weaker ordering specifications. |
Atomic read-modify-write operations should always read the last value (in the modification order) stored before the write associated with the read-modify-write operation. [C11 standard, Section 7.17.3, paragraph 12.]
Implementations should ensure that no "out-of-thin-air" values are computed that circularly depend on their own computation.
Note: Under the rules described above, and independent to the previously footnoted C++ issue, it is known that x == y == 42 is a valid final state in the following problematic example:
global atomic_int x = ATOMIC_VAR_INIT(0);
local atomic_int y = ATOMIC_VAR_INIT(0);
unit_of_execution_1:
... [execution not reading or writing x or y, leading up to:]
int t = atomic_load_explicit(&y, memory_order_acquire);
atomic_store_explicit(&x, t, memory_order_release);
unit_of_execution_2:
... [execution not reading or writing x or y, leading up to:]
int t = atomic_load_explicit(&x, memory_order_acquire);
atomic_store_explicit(&y, t, memory_order_release);
This is not useful behavior and implementations should not exploit this phenomenon. It should be expected that in the future this may be disallowed by appropriate updates to the memory model description by the OpenCL committee.
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time. [C11 standard, Section 7.17.3, paragraph 16.]
As long as the following conditions are met, a host program sharing SVM memory with a kernel executing on one or more OpenCL 2.x devices may use atomic and synchronization operations to ensure that its assignments, and those of the kernel, are visible to each other:
-
Either fine-grained buffer or fine-grained system SVM must be used to share memory. While coarse-grained buffer SVM allocations may support atomic operations, visibility on these allocations is not guaranteed except at map and unmap operations.
-
The optional OpenCL 2.x SVM atomic-controlled visibility specified by provision of the
CL_MEM_
flag must be supported by the device and the flag provided to the SVM buffer on allocation.SVM_ ATOMICS -
The host atomic and synchronization operations must be compatible with those of an OpenCL kernel language. This requires that the size and representation of the data types that the host atomic operations act on be consistent with the OpenCL kernel language atomic types.
If these conditions are met, the host operations will apply at all_svm_devices scope.
3.3.7.2. Fence Operations
This section describes how the OpenCL 2.x fence operations contribute to the local- and global-happens-before relations.
Earlier, we introduced synchronization primitives called fences. Fences can utilize the acquire memory_order, release memory_order, or both. A fence with acquire semantics is called an acquire fence; a fence with release semantics is called a release fence. The overview of atomic and fence operations section describes the memory orders that result in acquire and release fences.
A global release fence A global-synchronizes-with a global acquire fence B if there exist atomic operations X and Y, both operating on some global atomic object M, such that A is sequenced-before X, X modifies M, Y is sequenced-before B, Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation, and that the scopes of A, B are inclusive. [C11 standard, Section 7.17.4, paragraph 2, modified.]
A global release fence A global-synchronizes-with an atomic operation B that performs an acquire operation on a global atomic object M if there exists an atomic operation X such that A is sequenced-before X, X modifies M, B reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation, and the scopes of A and B are inclusive. [C11 standard, Section 7.17.4, paragraph 3, modified.]
An atomic operation A that is a release operation on a global atomic object M global-synchronizes-with a global acquire fence B if there exists some atomic operation X on M such that X is sequenced-before B and reads the value written by A or a value written by any side effect in the release sequence headed by A, and the scopes of A and B are inclusive. [C11 standard, Section 7.17.4, paragraph 4, modified.]
A local release fence A local-synchronizes-with a local acquire fence B if there exist atomic operations X and Y, both operating on some local atomic object M, such that A is sequenced-before X, X modifies M, Y is sequenced-before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation, and the scopes of A and B are inclusive. [C11 standard, Section 7.17.4, paragraph 2, modified.]
A local release fence A local-synchronizes-with an atomic operation B that performs an acquire operation on a local atomic object M if there exists an atomic operation X such that A is sequenced-before X, X modifies M, and B reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation, and the scopes of A and B are inclusive. [C11 standard, Section 7.17.4, paragraph 3, modified.]
An atomic operation A that is a release operation on a local atomic object M local-synchronizes-with a local acquire fence B if there exists some atomic operation X on M such that X is sequenced-before B and reads the value written by A or a value written by any side effect in the release sequence headed by A, and the scopes of A and B are inclusive. [C11 standard, Section 7.17.4, paragraph 4, modified.]
Let X and Y be two work-item fences that each have both the
CLK_GLOBAL_MEM_FENCE
and CLK_LOCAL_MEM_FENCE
flags set.
X global-synchronizes-with Y and X local synchronizes with Y if the
conditions required for X to global-synchronize with Y are met, the
conditions required for X to local-synchronize-with Y are met, or both
sets of conditions are met.
3.3.7.3. Work-group Functions
The OpenCL kernel execution model includes collective operations across the work-items within a single work-group. These are called work-group functions, and include functions such as barriers, scans, reductions, and broadcasts. We will first discuss the work-group barrier function. Other work-group functions are discussed afterwards.
The barrier function provides a mechanism for a kernel to synchronize the work-items within a single work-group: informally, each work-item of the work-group must execute the barrier before any are allowed to proceed. It also orders memory operations to a specified combination of one or more address spaces such as local memory or global memory, in a similar manner to a fence.
To precisely specify the memory ordering semantics for barrier, we need to distinguish between a dynamic and a static instance of the call to a barrier. A call to a barrier can appear in a loop, for example, and each execution of the same static barrier call results in a new dynamic instance of the barrier that will independently synchronize a work-groups work-items.
A work-item executing a dynamic instance of a barrier results in two operations, both fences, that are called the entry and exit fences. These fences obey all the rules for fences specified elsewhere in this chapter as well as the following:
-
The entry fence is a release fence with the same flags and scope as requested for the barrier.
-
The exit fence is an acquire fence with the same flags and scope as requested for the barrier.
-
For each work-item the entry fence is sequenced before the exit fence.
-
If the flags have
CLK_GLOBAL_MEM_FENCE
set then for each work-item the entry fence global-synchronizes-with the exit fence of all other work-items in the same work-group. -
If the flags have
CLK_LOCAL_MEM_FENCE
set then for each work-item the entry fence local-synchronizes-with the exit fence of all other work-items in the same work-group.
Other work-group functions include such functions as scans, reductions, and broadcasts, and are described in the kernel language and IL specifications. The use of these work-group functions implies sequenced-before relationships between statements within the execution of a single work-item in order to satisfy data dependencies. For example, a work-item that provides a value to a work-group function must behave as if it generates that value before beginning execution of that work-group function. Furthermore, the programmer must ensure that all work-items in a work-group must execute the same work-group function call site, or dynamic work-group function instance.
3.3.7.4. Sub-group Functions
Sub-group functions are missing before version 2.1. Also see extension cl_khr_subgroups. |
The OpenCL kernel execution model includes collective operations across the work-items within a single sub-group. These are called sub-group functions. We will first discuss the sub-group barrier. Other sub-group functions are discussed afterwards.
The barrier function provides a mechanism for a kernel to synchronize the work-items within a single sub-group: informally, each work-item of the sub-group must execute the barrier before any are allowed to proceed. It also orders memory operations to a specified combination of one or more address spaces such as local memory or global memory, in a similar manner to a fence.
To precisely specify the memory ordering semantics for barrier, we need to distinguish between a dynamic and a static instance of the call to a barrier. A call to a barrier can appear in a loop, for example, and each execution of the same static barrier call results in a new dynamic instance of the barrier that will independently synchronize a sub-groups work-items.
A work-item executing a dynamic instance of a barrier results in two operations, both fences, that are called the entry and exit fences. These fences obey all the rules for fences specified elsewhere in this chapter as well as the following:
-
The entry fence is a release fence with the same flags and scope as requested for the barrier.
-
The exit fence is an acquire fence with the same flags and scope as requested for the barrier.
-
For each work-item the entry fence is sequenced before the exit fence.
-
If the flags have
CLK_GLOBAL_MEM_FENCE
set then for each work-item the entry fence global-synchronizes-with the exit fence of all other work-items in the same sub-group. -
If the flags have
CLK_LOCAL_MEM_FENCE
set then for each work-item the entry fence local-synchronizes-with the exit fence of all other work-items in the same sub-group.
Other sub-group functions include such functions as scans, reductions, and broadcasts, and are described in the kernel languages and IL specifications. The use of these sub-group functions implies sequenced-before relationships between statements within the execution of a single work-item in order to satisfy data dependencies. For example, a work-item that provides a value to a sub-group function must behave as if it generates that value before beginning execution of that sub-group function. Furthermore, the programmer must ensure that all work-items in a sub-group must execute the same sub-group function call site, or dynamic sub-group function instance.
3.3.7.5. Host-side and Device-side Commands
This section describes how the OpenCL API functions associated with command-queues contribute to happens-before relations. There are two types of command queues and associated API functions in OpenCL 2.x; host command-queues and device command-queues. The interaction of these command queues with the memory model are for the most part equivalent. In a few cases, the rules only applies to the host command-queue. We will indicate these special cases by specifically denoting the host command-queue in the memory ordering rule. SVM memory consistency in such instances is implied only with respect to synchronizing host commands.
Memory ordering rules in this section apply to all memory objects (buffers, images and pipes) as well as to SVM allocations where no earlier, and more fine-grained, rules apply.
In the remainder of this section, we assume that each command C enqueued onto a command-queue has an associated event object E that signals its execution status, regardless of whether E was returned to the unit of execution that enqueued C. We also distinguish between the API function call that enqueues a command C and creates an event E, the execution of C, and the completion of C(which marks the event E as complete).
The ordering and synchronization rules for API commands are defined as following:
-
If an API function call X enqueues a command C, then X global-synchronizes-with C. For example, a host API function to enqueue a kernel global-synchronizes-with the start of that kernel-instances execution, so that memory updates sequenced-before the enqueue kernel function call will global-happen-before any kernel reads or writes to those same memory locations. For a device-side enqueue, global memory updates sequenced before X happens-before C reads or writes to those memory locations only in the case of fine-grained SVM.
-
If E is an event upon which a command C waits, then E global-synchronizes-with C. In particular, if C waits on an event E that is tracking the execution status of the command C1, then memory operations done by C1 will global-happen-before memory operations done by C. As an example, assume we have an OpenCL program using coarse-grain SVM sharing that enqueues a kernel to a host command-queue to manipulate the contents of a region of a buffer that the host thread then accesses after the kernel completes. To do this, the host thread can call clEnqueueMapBuffer to enqueue a blocking-mode map command to map that buffer region, specifying that the map command must wait on an event signaling the kernels completion. When clEnqueueMapBuffer returns, any memory operations performed by the kernel to that buffer region will global- happen-before subsequent memory operations made by the host thread.
-
If a command C has an event E that signals its completion, then C global- synchronizes-with E.
-
For a command C enqueued to a host-side command queue, if C has an event E that signals its completion, then E global-synchronizes-with an API call X that waits on E. For example, if a host thread or kernel-instance calls the wait-for-events function on E (e.g. the clWaitForEvents function called from a host thread), then E global-synchronizes-with that wait-for-events function call.
-
If commands C and C1 are enqueued in that sequence onto an in-order command-queue, then the event (including the event implied between C and C1 due to the in-order queue) signaling C's completion global-synchronizes-with C1. Note that in OpenCL 2.x, only a host command-queue can be configured as an in-order queue.
-
If an API call enqueues a marker command C with an empty list of events upon which C should wait, then the events of all commands enqueued prior to C in the command-queue global-synchronize-with C.
-
If a host API call enqueues a command-queue barrier command C with an empty list of events on which C should wait, then the events of all commands enqueued prior to C in the command-queue global-synchronize-with C. In addition, the event signaling the completion of C global-synchronizes-with all commands enqueued after C in the command-queue.
-
If a host thread executes a clFinish call X, then the events of all commands enqueued prior to X in the command-queue global-synchronizes-with X.
-
The start of a kernel-instance K global-synchronizes-with all operations in the work-items of K. Note that this includes the execution of any atomic operations by the work-items in a program using fine-grain SVM.
-
All operations of all work-items of a kernel-instance K global-synchronizes-with the event signaling the completion of K. Note that this also includes the execution of any atomic operations by the work-items in a program using fine-grain SVM.
-
If a callback procedure P is registered on an event E, then E global-synchronizes-with all operations of P. Note that callback procedures are only defined for commands within host command-queues.
-
If C is a command that waits for an event E's completion, and API function call X sets the status of a user event E's status to
CL_COMPLETE
(for example, from a host thread using a clSetUserEventStatus function), then X global-synchronizes-with C. -
If a device enqueues a command C with the
CLK_ENQUEUE_FLAGS_WAIT_KERNEL
flag, then the end state of the parent kernel instance global-synchronizes with C. -
If a work-group enqueues a command C with the
CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP
flag, then the end state of the work-group global-synchronizes with C.
When using an out-of-order command queue, a wait on an event or a marker or command-queue barrier command can be used to ensure the correct ordering of dependent commands. In those cases, the wait for the event or the marker or barrier command will provide the necessary global-synchronizes-with relation.
In this situation:
-
access to shared locations or disjoint locations in a single
cl_mem
object when using atomic operations from different kernel instances enqueued from the host such that one or more of the atomic operations is a write is implementation-defined and correct behavior is not guaranteed except at synchronization points. -
access to shared locations or disjoint locations in a single
cl_mem
object when using atomic operations from different kernel instances consisting of a parent kernel and any number of child kernels enqueued by that kernel is guaranteed under the memory ordering rules described earlier in this section. -
access to shared locations or disjoint locations in a single program scope global variable, coarse-grained SVM allocation or fine-grained SVM allocation when using atomic operations from different kernel instances enqueued from the host to a single device is guaranteed under the memory ordering rules described earlier in this section.
If fine-grain SVM is used but without support for the OpenCL 2.x atomic operations, then the host and devices can concurrently read the same memory locations and can concurrently update non-overlapping memory regions, but attempts to update the same memory locations are undefined. Memory consistency is guaranteed at the OpenCL synchronization points without the need for calls to clEnqueueMapBuffer and clEnqueueUnmapMemObject. For fine-grained SVM buffers it is guaranteed that at synchronization points only values written by the kernel will be updated. No writes to fine-grained SVM buffers can be introduced that were not in the original program.
In the remainder of this section, we discuss a few points regarding the ordering rules for commands with a host command queue.
In an OpenCL 1.x implementation a synchronization point is a kernel-instance or host program location where the contents of memory visible to different work-items or command-queue commands are the same. It also says that waiting on an event and a command-queue barrier are synchronization points between commands in command-queues. Four of the rules listed above (2, 4, 7, and 8) cover these OpenCL synchronization points. |
A map operation (clEnqueueMapBuffer or clEnqueueMapImage) performed on a non-SVM buffer or a coarse-grained SVM buffer is allowed to overwrite the entire target region with the latest runtime view of the data as seen by the command with which the map operation synchronizes, whether the values were written by the executing kernels or not. Any values that were changed within this region by another kernel or host thread while the kernel synchronizing with the map operation was executing may be overwritten by the map operation.
Access to non-SVM cl_mem
buffers and coarse-grained SVM allocations is
ordered at synchronization points between host commands.
In the presence of an out-of-order command queue or a set of command queues
mapped to the same device, multiple kernel instances may execute
concurrently on the same device.
3.4. The OpenCL Framework
The OpenCL framework allows applications to use a host and one or more OpenCL devices as a single heterogeneous parallel computer system. The framework contains the following components:
-
OpenCL Platform layer: The platform layer allows the host program to discover OpenCL devices and their capabilities and to create contexts.
-
OpenCL Runtime: The runtime allows the host program to manipulate contexts once they have been created.
-
OpenCL Compiler: The OpenCL compiler creates program executables that contain OpenCL kernels. The OpenCL compiler may build program executables from OpenCL C source strings, the SPIR-V intermediate language, or device-specific program binary objects, depending on the capabilities of a device. Other kernel languages or intermediate languages may be supported by some implementations.
3.4.1. Mixed Version Support
Mixed version support missing before version 1.1. |
OpenCL supports devices with different capabilities under a single platform. This includes devices which conform to different versions of the OpenCL specification. There are three version identifiers to consider for an OpenCL system: the platform version, the version of a device, and the version(s) of the kernel language or IL supported on a device.
The platform version indicates the version of the OpenCL runtime that is supported. This includes all of the APIs that the host can use to interact with resources exposed by the OpenCL runtime; including contexts, memory objects, devices, and command queues.
The device version is an indication of the device’s capabilities separate from the runtime and compiler as represented by the device info returned by clGetDeviceInfo. Examples of attributes associated with the device version are resource limits (e.g., minimum size of local memory per compute unit) and extended functionality (e.g., list of supported KHR extensions). The version returned corresponds to the highest version of the OpenCL specification for which the device is conformant, but is not higher than the platform version.
The language version for a device represents the OpenCL programming language features a developer can assume are supported on a given device. The version reported is the highest version of the language supported.
3.4.2. Backwards Compatibility
Backwards compatibility is an important goal for the OpenCL standard. Backwards compatibility is expected such that a device will consume earlier versions of the OpenCL C programming languages and the SPIR-V intermediate language with the following minimum requirements:
-
An OpenCL 1.x device must support at least one 1.x version of the OpenCL C programming language.
-
An OpenCL 2.0 device must support all the requirements of an OpenCL 1.2 device in addition to the OpenCL C 2.0 programming language. If multiple language versions are supported, the compiler defaults to using the OpenCL C 1.2 language version. To utilize the OpenCL 2.0 Kernel programming language, a programmer must specifically pass the appropriate compiler build option (
-cl-std=CL2.0
). The language version must not be higher than the platform version, but may exceed the device version. -
An OpenCL 2.1 device must support all the requirements of an OpenCL 2.0 device in addition to the SPIR-V intermediate language at version 1.0 or above. Intermediate language versioning is encoded as part of the binary object and no flags are required to be passed to the compiler.
-
An OpenCL 2.2 device must support all the requirements of an OpenCL 2.0 device in addition to the SPIR-V intermediate language at version 1.2 or above. Intermediate language versioning is encoded as a part of the binary object and no flags are required to be passed to the compiler.
-
OpenCL 3.0 is designed to enable any OpenCL implementation supporting OpenCL 1.2 or newer to easily support and transition to OpenCL 3.0, by making many features in OpenCL 2.0, 2.1, or 2.2 optional. This means that OpenCL 3.0 is backwards compatible with OpenCL 1.2, but is not necessarily backwards compatible with OpenCL 2.0, 2.1, or 2.2.
An OpenCL 3.0 platform must implement all OpenCL 3.0 APIs, but some APIs may return an error code unconditionally when a feature is not supported by any devices in the platform. Whenever a feature is optional, it will be paired with a query to determine whether the feature is supported. The queries will enable correctly written applications to selectively use all optional features without generating any OpenCL errors, if desired.
OpenCL 3.0 also adds a new version of the OpenCL C programming language, which makes many features in OpenCL C 2.0 optional. The new version of OpenCL C is backwards compatible with OpenCL C 1.2, but is not backwards compatible with OpenCL C 2.0. The new version of OpenCL C must be explicitly requested via the
-cl-std=
build option, otherwise a program will continue to be compiled using the highest OpenCL C 1.x language version supported for the device.Whenever an OpenCL C feature is optional in the new version of the OpenCL C programming language, it will be paired with a feature macro, such as
__opencl_c_feature_name
, and a corresponding API query. If a feature macro is defined then the feature is supported by the OpenCL C compiler, otherwise the optional feature is not supported.
In order to allow future versions of OpenCL to support new types of devices, minor releases of OpenCL may add new profiles where some features that are currently required for all OpenCL devices become optional. All features that are required for an OpenCL profile will also be required for that profile in subsequent minor releases of OpenCL, thereby guaranteeing backwards compatibility for applications targeting specific profiles. It is therefore strongly recommended that applications query the profile supported by the OpenCL device they are running on in order to remain robust to future changes.
3.4.3. Versioning
The OpenCL specification is regularly updated with bug fixes and clarifications. Occasionally new functionality is added to the core and extensions. In order to indicate to developers how and when these changes are made to the specification, and to provide a way to identify each set of changes, the OpenCL API, C language, intermediate languages and extensions maintain a version number. Built-in kernels are also versioned.
3.4.3.1. Versions
A version number comprises three logical fields:
-
The major version indicates a significant change. Backwards compatibility may break across major versions.
-
The minor version indicates the addition of new functionality with backwards compatibility for any existing profiles.
-
The patch version indicates bug fixes, clarifications and general improvements.
Version numbers are represented using the cl_version
type that is an alias for
a 32-bit integer. The fields are packed as follows:
-
The major version is a 10-bit integer packed into bits 31-22.
-
The minor version is a 10-bit integer packed into bits 21-12.
-
The patch version is a 12-bit integer packed into bits 11-0.
This enables versions to be ordered using standard C/C++ operators.
A number of convenience macros are provided by the OpenCL Headers to make working with version numbers easier.
CL_VERSION_MAJOR
extracts the major version from a packed cl_version
.
CL_VERSION_MINOR
extracts the minor version from a packed cl_version
.
CL_VERSION_PATCH
extracts the patch version from a packed cl_version
.
CL_MAKE_VERSION
returns a packed cl_version
from a major, minor and
patch version.
These are defined as follows:
typedef cl_uint cl_version;
#define CL_VERSION_MAJOR_BITS (10)
#define CL_VERSION_MINOR_BITS (10)
#define CL_VERSION_PATCH_BITS (12)
#define CL_VERSION_MAJOR_MASK ((1 << CL_VERSION_MAJOR_BITS) - 1)
#define CL_VERSION_MINOR_MASK ((1 << CL_VERSION_MINOR_BITS) - 1)
#define CL_VERSION_PATCH_MASK ((1 << CL_VERSION_PATCH_BITS) - 1)
#define CL_VERSION_MAJOR(version) \
((version) >> (CL_VERSION_MINOR_BITS + CL_VERSION_PATCH_BITS))
#define CL_VERSION_MINOR(version) \
(((version) >> CL_VERSION_PATCH_BITS) & CL_VERSION_MINOR_MASK)
#define CL_VERSION_PATCH(version) ((version) & CL_VERSION_PATCH_MASK)
#define CL_MAKE_VERSION(major, minor, patch) \
((((major) & CL_VERSION_MAJOR_MASK) << \
(CL_VERSION_MINOR_BITS + CL_VERSION_PATCH_BITS)) | \
(((minor) & CL_VERSION_MINOR_MASK) << \
CL_VERSION_PATCH_BITS) | \
((patch) & CL_VERSION_PATCH_MASK))
3.4.3.2. Version name pairing
It is sometimes necessary to associate a version to an entity it applies to
(e.g. extension or built-in kernel). This is done using a dedicated
cl_name_
structure, defined as follows:
typedef struct cl_name_version {
cl_version version;
char name[CL_NAME_VERSION_MAX_NAME_SIZE];
} cl_name_version;
The name
field is an array of CL_NAME_VERSION_MAX_NAME_SIZE
bytes used as
storage for a NUL-terminated string whose maximum length is therefore
CL_NAME_VERSION_MAX_NAME_SIZE - 1
.
4. The OpenCL Platform Layer
This section describes the OpenCL platform layer which implements platform-specific features that allow applications to query OpenCL devices, device configuration information, and to create OpenCL contexts using one or more devices.
4.1. Querying Platform Info
The list of platforms available can be obtained with the function:
cl_int clGetPlatformIDs(
cl_uint num_entries,
cl_platform_id* platforms,
cl_uint* num_platforms);
-
num_entries is the number of
cl_platform_
entries that can be added to platforms. If platforms is notid NULL
, the num_entries must be greater than zero. -
platforms returns a list of OpenCL platforms found. The
cl_platform_
values returned in platforms can be used to identify a specific OpenCL platform. If platforms isid NULL
, this argument is ignored. The number of OpenCL platforms returned is the minimum of the value specified by num_entries or the number of OpenCL platforms available. -
num_platforms returns the number of OpenCL platforms available. If num_platforms is
NULL
, this argument is ignored.
clGetPlatformIDs returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if num_entries is equal to zero and platforms is notVALUE NULL
or if both num_platforms and platforms areNULL
. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
Specific information about an OpenCL platform can be obtained with the function:
cl_int clGetPlatformInfo(
cl_platform_id platform,
cl_platform_info param_name,
size_t param_value_size,
void* param_value,
size_t* param_value_size_ret);
-
platform refers to the platform ID returned by clGetPlatformIDs or can be
NULL
. If platform isNULL
, the behavior is implementation-defined. -
param_name is an enumeration constant that identifies the platform information being queried. It can be one of the following values as specified in the Platform Queries table.
-
param_value is a pointer to memory location where appropriate values for a given param_name, as specified in the Platform Queries table, will be returned. If param_value is
NULL
, it is ignored. -
param_value_size specifies the size in bytes of memory pointed to by param_value. This size in bytes must be ≥ size of return type specified in the Platform Queries table.
-
param_value_size_ret returns the actual size in bytes of data being queried by param_name. If param_value_size_ret is
NULL
, it is ignored.
The information that can be queried using clGetPlatformInfo is specified in the Platform Queries table.
Platform Info | Return Type | Description |
---|---|---|
|
OpenCL profile string. Returns the profile name supported by the implementation. The profile name returned can be one of the following strings: FULL_PROFILE - if the implementation supports the OpenCL specification (functionality defined as part of the core specification and does not require any extensions to be supported). EMBEDDED_PROFILE - if the implementation supports the OpenCL embedded profile. The embedded profile is defined to be a subset for each version of OpenCL. The embedded profile for OpenCL is described in OpenCL Embedded Profile. |
|
|
OpenCL version string. Returns the OpenCL version supported by the implementation. This version string has the following format: OpenCL<space><major_version.minor_version><space><platform-specific information> The major_version.minor_version value returned will be one of 1.0, 1.1, 1.2, 2.0, 2.1, 2.2 or 3.0. |
|
Missing before version 3.0. |
|
Returns the detailed (major, minor, patch) version supported by the
platform. The major and minor version numbers returned must match
those returned via |
|
Platform name string. |
|
|
Platform vendor string. |
|
|
Returns a space separated list of extension names (the extension names themselves do not contain any spaces) supported by the platform. Each extension that is supported by all devices associated with this platform must be reported here. |
|
Missing before version 3.0. |
Returns an array of description (name and version) structures that lists
all the extensions supported by the platform. The same extension name
must not be reported more than once. The list of extensions reported
must match the list reported via |
|
Missing before version 2.1. |
|
Returns the resolution of the host timer in nanoseconds as used by clGetDeviceAndHostTimer. Support for device and host timer synchronization is required for platforms supporting OpenCL 2.1 or 2.2. This value must be 0 for devices that do not support device and host timer synchronization. |
clGetPlatformInfo returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following
errors [4].
-
CL_INVALID_
if platform is not a valid platform.PLATFORM -
CL_INVALID_
if param_name is not one of the supported values or if size in bytes specified by param_value_size is < size of return type as specified in the OpenCL Platform Queries table, and param_value is not aVALUE NULL
value. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
4.2. Querying Devices
The list of devices available on a platform can be obtained using the function [5]:
cl_int clGetDeviceIDs(
cl_platform_id platform,
cl_device_type device_type,
cl_uint num_entries,
cl_device_id* devices,
cl_uint* num_devices);
-
platform refers to the platform ID returned by clGetPlatformIDs or can be
NULL
. If platform isNULL
, the behavior is implementation-defined. -
device_type is a bitfield that identifies the type of OpenCL device. The device_type can be used to query specific OpenCL devices or all OpenCL devices available. The valid values for device_type are specified in the Device Types table.
-
num_entries is the number of
cl_device_
entries that can be added to devices. If devices is notid NULL
, the num_entries must be greater than zero. -
devices returns a list of OpenCL devices found. The
cl_device_
values returned in devices can be used to identify a specific OpenCL device. If devices isid NULL
, this argument is ignored. The number of OpenCL devices returned is the minimum of the value specified by num_entries or the number of OpenCL devices whose type matches device_type. -
num_devices returns the number of OpenCL devices available that match device_type. If num_devices is
NULL
, this argument is ignored.
Device Type | Description |
---|---|
An OpenCL device similar to a traditional CPU (Central Processing Unit). The host processor that executes OpenCL host code may also be considered a CPU OpenCL device. |
|
An OpenCL device similar to a GPU (Graphics Processing Unit). Many systems include a dedicated processor for graphics or rendering that may be considered a GPU OpenCL device. |
|
Dedicated devices that may accelerate OpenCL programs, such as FPGAs (Field Programmable Gate Arrays), DSPs (Digital Signal Processors), or AI (Artificial Intelligence) processors. |
|
Missing before version 1.2. |
Specialized devices that implement some of the OpenCL runtime APIs but do not support all required OpenCL functionality. |
The default OpenCL device in the platform.
The default OpenCL device must not be a |
|
All OpenCL devices available in the platform, except for
|
The device type is purely informational and has no semantic meaning.
Some devices may be more than one type.
For example, a CL_DEVICE_
device may also be a
CL_DEVICE_
device, or a CL_DEVICE_
device
may also be some other, more descriptive device type.
CL_DEVICE_
devices must not be combined with any other
device types.
One device in the platform should be a CL_DEVICE_
device.
The default device should also be a more specific device type, such
as CL_DEVICE_
or CL_DEVICE_
.
clGetDeviceIDs returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if platform is not a valid platform.PLATFORM -
CL_INVALID_
if device_type is not a valid value.DEVICE_ TYPE -
CL_INVALID_
if num_entries is equal to zero and devices is notVALUE NULL
or if both num_devices and devices areNULL
. -
CL_DEVICE_
if no OpenCL devices that matched device_type were found.NOT_ FOUND -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
The application can query specific capabilities of the OpenCL device(s) returned by clGetDeviceIDs. This can be used by the application to determine which device(s) to use.
To get specific information about an OpenCL device, call the function:
cl_int clGetDeviceInfo(
cl_device_id device,
cl_device_info param_name,
size_t param_value_size,
void* param_value,
size_t* param_value_size_ret);
-
device may be a device returned by clGetDeviceIDs or a sub-device created by clCreateSubDevices. If device is a sub-device, the specific information for the sub-device will be returned. The information that can be queried using clGetDeviceInfo is specified in the Device Queries table.
-
param_name is an enumeration constant that identifies the device information being queried. It can be one of the following values as specified in the Device Queries table.
-
param_value is a pointer to memory location where appropriate values for a given param_name, as specified in the Device Queries table, will be returned. If param_value is
NULL
, it is ignored. -
param_value_size specifies the size in bytes of memory pointed to by param_value. This size in bytes must be ≥ size of return type specified in the Device Queries table.
-
param_value_size_ret returns the actual size in bytes of data being queried by param_name. If param_value_size_ret is
NULL
, it is ignored.
The device queries described in the Device Queries table should return the same information for a root-level device i.e. a device returned by clGetDeviceIDs and any sub-devices created from this device except for the following queries:
Device Info | Return Type | Description |
---|---|---|
|
The type or types of the OpenCL device. Please see the Device Types table for supported device types and device type combinations. |
|
|
A unique device vendor identifier. If the vendor has a PCI vendor ID, the low 16 bits must contain that PCI
vendor ID, and the remaining bits must be set to zero. Otherwise, the
value returned must be a valid Khronos vendor ID represented by type
|
|
|
The number of parallel compute units on the OpenCL device. A work-group executes on a single compute unit. The minimum value is 1. |
|
|
Maximum dimensions that specify the global and local work-item IDs
used by the data parallel execution model. (Refer to
clEnqueueNDRangeKernel).
The minimum value is 3 for devices that are not of type
|
|
|
Maximum number of work-items that can be specified in each dimension of the work-group to clEnqueueNDRangeKernel. Returns n The minimum value is (1, 1, 1) for devices that are not of type
|
|
|
Maximum number of work-items in a work-group that a device is
capable of executing on a single compute unit, for any given
kernel-instance running on the device. (Refer also to
clEnqueueNDRangeKernel and |
|
|
|
Preferred native vector width size for built-in scalar types that can be put into vectors. The vector width is defined as the number of scalar elements that can be stored in the vector. If double precision is not supported,
If the cl_khr_fp16 extension is not supported,
|
Missing before version 1.1. |
|
Returns the native ISA vector width. The vector width is defined as the number of scalar elements that can be stored in the vector. If double precision is not supported,
If the cl_khr_fp16 extension is not supported,
|
|
Clock frequency of the device in MHz. The meaning of this value is implementation-defined. For devices with multiple clock domains, the clock frequency for any of the clock domains may be returned. For devices that dynamically change frequency for power or thermal reasons, the returned clock frequency may be any valid frequency. Note: This definition is missing before version 2.2. Maximum configured clock frequency of the device in MHz. Note: This definition is deprecated by version 2.2. |
|
|
The default compute device address space size of the global address space specified as an unsigned integer value in bits. Currently supported values are 32 or 64 bits. |
|
|
Max size of memory object allocation in bytes.
The minimum value is max(min(1024 × 1024 × 1024, 1/4th
of |
|
|
Is |
|
|
Max number of image objects arguments of a kernel declared with the
read_only qualifier.
The minimum value is 128 if |
|
|
Max number of image objects arguments of a kernel declared with the
write_only qualifier.
The minimum value is 64 if |
|
Missing before version 2.0. |
|
Max number of image objects arguments of a kernel declared with the write_only or read_write qualifier. Support for read-write image arguments is required for an OpenCL 2.0, 2.1,
or 2.2 device if The minimum value is 64 if the device supports read-write images arguments, and must be 0 for devices that do not support read-write images. |
Missing before version 2.1. Also see extension cl_khr_il_program. |
|
The intermediate languages that can be supported by clCreateProgramWithIL for this device. Returns a space-separated list of IL version strings of the form <IL_Prefix>_<Major_Version>.<Minor_Version>. For an OpenCL 2.1 or 2.2 device, SPIR-V is a required IL prefix. If the device does not support intermediate language programs, the
value must be |
Missing before version 3.0. Also see extension cl_khr_il_program. |
Returns an array of descriptions (name and version) for all supported
intermediate languages. Intermediate languages with the same name may be
reported more than once but each name and major/minor version
combination may only be reported once. The list of intermediate
languages reported must match the list reported via
For an OpenCL 2.1 or 2.2 device, at least one version of SPIR-V must be reported. |
|
|
Max width of 2D image or 1D image not created from a buffer object in pixels. The minimum value is 16384 if |
|
|
Max height of 2D image in pixels. The minimum value is 16384 if |
|
|
Max width of 3D image in pixels. The minimum value is 2048 if |
|
|
Max height of 3D image in pixels. The minimum value is 2048 if |
|
|
Max depth of 3D image in pixels. The minimum value is 2048 if |
|
Missing before version 1.2. |
|
Max number of pixels for a 1D image created from a buffer object. The minimum value is 65536 if |
Missing before version 1.2. |
|
Max number of images in a 1D or 2D image array. The minimum value is 2048 if |
|
Maximum number of samplers that can be used in a kernel. The minimum value is 16 if |
|
Missing before version 2.0. |
|
The row pitch alignment size in pixels for 2D images created from a buffer. The value returned must be a power of 2. Support for 2D images created from a buffer is required for an OpenCL 2.0, 2.1,
or 2.2 device if This value must be 0 for devices that do not support 2D images created from a buffer. |
Missing before version 2.0. |
|
This query specifies the minimum alignment in pixels of the host_ptr
specified to clCreateBuffer or clCreateBufferWithProperties when a 2D image
is created from a buffer which was created using Support for 2D images created from a buffer is required for an OpenCL 2.0, 2.1,
or 2.2 device if This value must be 0 for devices that do not support 2D images created from a buffer. |
Missing before version 2.0. |
|
The maximum number of pipe objects that can be passed as arguments to a kernel. The minimum value is 16 for devices supporting pipes, and must be 0 for devices that do not support pipes. |
Missing before version 2.0. |
|
The maximum number of reservations that can be active for a pipe per work-item in a kernel. A work-group reservation is counted as one reservation per work-item. The minimum value is 1 for devices supporting pipes, and must be 0 for devices that do not support pipes. |
Missing before version 2.0. |
|
The maximum size of pipe packet in bytes. Support for pipes is required for an OpenCL 2.0, 2.1, or 2.2 device. The minimum value is 1024 bytes if the device supports pipes, and must be 0 for devices that do not support pipes. |
|
Max size in bytes of all arguments that can be passed to a kernel. The minimum value is 1024 for devices that are not of type
|
|
|
Alignment requirement (in bits) for sub-buffer offsets.
The minimum value is the size (in bits) of the largest OpenCL
built-in data type supported by the device (long16 in FULL profile,
long16 or int16 in EMBEDDED profile) for devices that are not of
type |
|
Deprecated by version 1.2. |
|
The minimum value is the size (in bytes) of the largest OpenCL data
type supported by the device ( |
|
Describes single precision floating-point capability of the device. This is a bit-field that describes one or more of the following values: For the full profile, the mandated minimum floating-point capability
for devices that are not of type For the embedded profile, see the dedicated table. |
|
Missing before version 1.2. Also see extension cl_khr_fp64. |
|
Describes double precision floating-point capability of the OpenCL device. This is a bit-field that describes one or more of the following values: Double precision is an optional feature so the mandated minimum double precision floating-point capability is 0. If double precision is supported by the device, then the minimum double precision floating-point capability for OpenCL 2.0 or newer devices is: or for OpenCL 1.0, OpenCL 1.1 or OpenCL 1.2 devices: |
|
Type of global memory cache supported.
Valid values are: |
|
|
Size of global memory cache line in bytes. |
|
|
Size of global memory cache in bytes. |
|
|
Size of global device memory in bytes. |
|
|
Max size in bytes of a constant buffer allocation.
The minimum value is 64 KB for devices that are not of type
|
|
|
Max number of arguments declared with the |
|
Missing before version 2.0. |
|
The maximum number of bytes of storage that may be allocated for any single variable in program scope or inside a function in an OpenCL kernel language declared in the global address space. Support for program scope global variables is required for an OpenCL 2.0, 2.1, or 2.2 device. The minimum value is 64 KB if the device supports program scope global variables, and must be 0 for devices that do not support program scope global variables. |
Missing before version 2.0. |
|
Maximum preferred total size, in bytes, of all program variables in the global address space. This is a performance hint. An implementation may place such variables in storage with optimized device access. This query returns the capacity of such storage. The minimum value is 0. |
|
Type of local memory supported.
This can be set to For custom devices, |
|
|
Size of local memory region in bytes.
The minimum value is 32 KB for devices that are not of type
|
|
|
Is |
|
Missing before version 1.1 and deprecated by version 2.0. |
|
Is |
|
Describes the resolution of device timer. This is measured in nanoseconds. Refer to Profiling Operations for details. |
|
|
Is |
|
|
Is |
|
|
Is Is |
|
Missing before version 1.2. |
|
Is This can be This must be |
|
Describes the execution capabilities of the device. This is a bit-field that describes one or more of the following values: The mandated minimum capability is: |
|
Deprecated by version 2.0. |
|
See description of |
Missing before version 2.0. |
|
Describes the on host command-queue properties supported by the device. This is a bit-field that describes one or more of the following values: These properties are described in the Queue Properties table. The mandated minimum capability is: |
Missing before version 2.0. |
|
Describes the on device command-queue properties supported by the device. This is a bit-field that describes one or more of the following values: These properties are described in the Queue Properties table. Support for on-device queues is required for an OpenCL 2.0, 2.1, or 2.2 device. When on-device queues are supported, the mandated minimum capability is: Must be 0 for devices that do not support on-device queues. |
Missing before version 2.0. |
|
The preferred size of the device queue, in bytes. Applications should use this size for the device queue to ensure good performance. The minimum value is 16 KB for devices supporting on-device queues, and must be 0 for devices that do not support on-device queues. |
Missing before version 2.0. |
|
The maximum size of the device queue in bytes. The minimum value is 256 KB for the full profile and 64 KB for the embedded profile for devices supporting on-device queues, and must be 0 for devices that do not support on-device queues. |
Missing before version 2.0. |
|
The maximum number of device queues that can be created for this device in a single context. The minimum value is 1 for devices supporting on-device queues, and must be 0 for devices that do not support on-device queues. |
Missing before version 2.0. |
|
The maximum number of events in use by a device queue.
These refer to events returned by the The minimum value is 1024 for devices supporting on-device queues, and must be 0 for devices that do not support on-device queues. |
Missing before version 1.2. |
|
A semi-colon separated list of built-in kernels supported by the device. An empty string is returned if no built-in kernels are supported by the device. |
Missing before version 3.0. |
Returns an array of descriptions for the built-in kernels supported by
the device. Each built-in kernel may only be reported once. The list of
reported kernels must match the list returned via
|
|
|
The platform associated with this device. |
|
|
Device name string. |
|
|
Vendor name string. |
|
|
OpenCL software driver version string. Follows a vendor-specific format. |
|
|
OpenCL profile string. Returns the profile name supported by the device. The profile name returned can be one of the following strings: FULL_PROFILE - if the device supports the OpenCL specification (functionality defined as part of the core specification and does not require any extensions to be supported). EMBEDDED_PROFILE - if the device supports the OpenCL embedded profile. |
|
|
OpenCL version string. Returns the OpenCL version supported by the device. This version string has the following format: OpenCL<space><major_version.minor_version><space><vendor-specific information> The major_version.minor_version value returned will be one of 1.0, 1.1, 1.2, 2.0, 2.1, 2.2, or 3.0. |
|
Missing before version 3.0. |
|
Returns the detailed (major, minor, patch) version supported by the
device. The major and minor version numbers returned must match
those returned via |
Missing before version 1.1 and deprecated by version 3.0. |
|
Returns the highest fully backwards compatible OpenCL C version supported by the compiler for the device. For devices supporting compilation from OpenCL C source, this will return a version string with the following format: OpenCL<space>C<space><major_version.minor_version><space><vendor-specific information> For devices that support compilation from OpenCL C source: Because OpenCL 3.0 is backwards compatible with OpenCL C 1.2, an OpenCL 3.0 device must support at least OpenCL C 1.2. An OpenCL 3.0 device may return an OpenCL C version newer than OpenCL C 1.2 if and only if all optional OpenCL C features are supported by the device for the newer version. Support for OpenCL C 2.0 is required for an OpenCL 2.0, OpenCL 2.1, or OpenCL 2.2 device. Support for OpenCL C 1.2 is required for an OpenCL 1.2 device. Support for OpenCL C 1.1 is required for an OpenCL 1.1 device. Support for either OpenCL C 1.0 or OpenCL C 1.1 is required for an OpenCL 1.0 device. For devices that do not support compilation from OpenCL C source,
such as when This query has been superseded by the |
Missing before version 3.0. |
Returns an array of name, version descriptions listing all the versions
of OpenCL C supported by the compiler for the device.
In each returned description structure, the name field is required to be
"OpenCL C". The list may include both newer non-backwards compatible
OpenCL C versions, such as OpenCL C 3.0, and older OpenCL C versions
with mandatory backwards compatibility.
The version returned by For devices that support compilation from OpenCL C source: Because OpenCL 3.0 is backwards compatible with OpenCL C 1.2, and OpenCL C 1.2 is backwards compatible with OpenCL C 1.1 and OpenCL C 1.0, support for at least OpenCL C 3.0, OpenCL C 1.2, OpenCL C 1.1, and OpenCL C 1.0 is required for an OpenCL 3.0 device. Support for OpenCL C 2.0, OpenCL C 1.2, OpenCL C 1.1, and OpenCL C 1.0 is required for an OpenCL 2.0, OpenCL 2.1, or OpenCL 2.2 device. Support for OpenCL C 1.2, OpenCL C 1.1, and OpenCL C 1.0 is required for an OpenCL 1.2 device. Support for OpenCL C 1.1 and OpenCL C 1.0 is required for an OpenCL 1.1 device. Support for at least OpenCL C 1.0 is required for an OpenCL 1.0 device. For devices that do not support compilation from OpenCL C source, this query may return an empty array. |
|
Missing before version 3.0. |
Returns an array of optional OpenCL C features supported by the compiler for the device alongside the OpenCL C version that introduced the feature macro. For example, if a compiler supports an OpenCL C 3.0 feature, the returned name will be the full name of the OpenCL C feature macro, and the returned version will be 3.0.0. For devices that do not support compilation from OpenCL C source, this query may return an empty array. |
|
|
Returns a space separated list of extension names (the extension names themselves do not contain any spaces) supported by the device. The list of extension names may include Khronos approved extension names and vendor specified extension names. The following Khronos extension names must be returned by all devices that support OpenCL 1.1: cl_khr_byte_addressable_store Additionally, the following Khronos extension names must be returned by all devices that support OpenCL 1.2 when and only when the optional feature is supported: cl_khr_fp64 Additionally, the following Khronos extension names must be returned by all devices that support OpenCL 2.0, OpenCL 2.1, or OpenCL 2.2. For devices that support OpenCL 3.0, these extension names must be returned when and only when the optional feature is supported: cl_khr_3d_image_writes Please refer to the OpenCL Extension Specification or vendor provided documentation for a detailed description of these extensions. |
|
Missing before version 3.0. |
Returns an array of description (name and version) structures. The same
extension name must not be reported more than once. The list of
extensions reported must match the list reported via
See |
|
Missing before version 1.2. |
|
Maximum size in bytes of the internal buffer that holds the output of printf calls from a kernel. The minimum value for the FULL profile is 1 MB. |
Missing before version 1.2. |
|
Is |
Missing before version 1.2. |
|
Returns the |
Missing before version 1.2. |
|
Returns the maximum number of sub-devices that can be created when a device is partitioned. The value returned cannot exceed |
Missing before version 1.2. |
|
Returns the list of partition types supported by device.
This is an array of If the device cannot be partitioned (i.e. there is no partitioning scheme supported by the device that will return at least two subdevices), a value of 0 will be returned. |
Missing before version 1.2. |
|
Returns the list of supported affinity domains for partitioning the
device using If the device does not support any affinity domains, a value of 0 will be returned. |
Missing before version 1.2. |
|
Returns the properties argument specified in clCreateSubDevices if
device is a sub-device.
In the case where the properties argument to clCreateSubDevices is
Otherwise the implementation may either return a param_value_size_ret of 0 i.e. there is no partition type associated with device or can return a property value of 0 (where 0 is used to terminate the partition property list) in the memory that param_value points to. |
Missing before version 1.2. |
|
Returns the device reference count. If the device is a root-level device, a reference count of one is returned. |
Missing before version 2.0. |
|
Describes the various shared virtual memory (SVM) memory allocation types the device supports. This is a bit-field that describes a combination of the following values: The mandated minimum capability for an OpenCL 2.0, 2.1, or 2.2 device is
For other device versions there is no mandated minimum capability. |
Missing before version 2.0. |
|
Returns the value representing the preferred alignment in bytes for OpenCL 2.0 fine-grained SVM atomic types. This query can return 0 which indicates that the preferred alignment is aligned to the natural size of the type. |
Missing before version 2.0. |
|
Returns the value representing the preferred alignment in bytes for OpenCL 2.0 atomic types to global memory. This query can return 0 which indicates that the preferred alignment is aligned to the natural size of the type. |
Missing before version 2.0. |
|
Returns the value representing the preferred alignment in bytes for OpenCL 2.0 atomic types to local memory. This query can return 0 which indicates that the preferred alignment is aligned to the natural size of the type. |
Missing before version 2.1. |
|
Maximum number of sub-groups in a work-group that a device is capable of executing on a single compute unit, for any given kernel-instance running on the device. The minimum value is 1 if the device supports subgroups, and must be 0 for devices that do not support subgroups. Support for subgroups is required for an OpenCL 2.1 or 2.2 device. (Refer also to clGetKernelSubGroupInfo.) |
Missing before version 2.1. |
|
Is This query must return |
Missing before version 3.0. |
|
Describes the various memory orders and scopes that the device supports for atomic memory operations. This is a bit-field that describes a combination of the following values: Because atomic memory orders are hierarchical, a device that supports a strong memory order must also support all weaker memory orders. Because atomic scopes are hierarchical, a device that supports a wide scope must also support all narrower scopes, except for the work-item scope, which is a special case. The mandated minimum capability is: |
Missing before version 3.0. |
|
Describes the various memory orders and scopes that the device supports for atomic fence operations.
This is a bit-field that has the same set of possible values as described for The mandated minimum capability is: |
Missing before version 3.0. |
|
Is |
Missing before version 3.0. |
|
Is |
Missing before version 3.0. |
|
Is |
Missing before version 3.0. |
|
Describes device-side enqueue capabilities of the device. This is a bit-field that describes one or more of the following values: If Devices that set |
Missing before version 3.0. |
|
Is Devices that return |
Missing before version 3.0. |
|
Returns the preferred multiple of work-group size for the given device. This is a performance hint intended as a guide when specifying the local work size argument to clEnqueueNDRangeKernel. (Refer also to clGetKernelWorkGroupInfo where |
Missing before version 3.0. |
|
Returns the latest version of the conformance test suite that this device has fully passed in accordance with the official conformance process. |
clGetDeviceInfo returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if device is not a valid device.DEVICE -
CL_INVALID_
if param_name is not one of the supported values or if size in bytes specified by param_value_size is < size of return type as specified in the Device Queries table and param_value is not aVALUE NULL
value or if param_name is a value that is available as an extension and the corresponding extension is not supported by the device. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To query device and host timestamps, call the function:
cl_int clGetDeviceAndHostTimer(
cl_device_id device,
cl_ulong* device_timestamp,
cl_ulong* host_timestamp);
clGetDeviceAndHostTimer is missing before version 2.1. |
-
device is a device returned by clGetDeviceIDs.
-
device_timestamp will be updated with the value of the device timer in nanoseconds. The resolution of the timer is the same as the device profiling timer returned by clGetDeviceInfo and the
CL_DEVICE_
query.PROFILING_ TIMER_ RESOLUTION -
host_timestamp will be updated with the value of the host timer in nanoseconds at the closest possible point in time to that at which device_timer was returned. The resolution of the timer may be queried via clGetPlatformInfo and the flag
CL_PLATFORM_
.HOST_ TIMER_ RESOLUTION
clGetDeviceAndHostTimer returns a reasonably synchronized pair of timestamps from the device timer and the host timer as seen by device. Implementations may need to execute this query with a high latency in order to provide reasonable synchronization of the timestamps. The host timestamp and device timestamp returned by this function and clGetHostTimer each have an implementation defined timebase. The timestamps will always be in their respective timebases regardless of which query function is used. The timestamp returned from clGetEventProfilingInfo for an event on a device and a device timestamp queried from the same device will always be in the same timebase.
clGetDeviceAndHostTimer will return CL_SUCCESS
with a time value in
host_timestamp if provided.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if device is not a valid device.DEVICE -
CL_INVALID_
if the platform associated with device does not support device and host timer synchronization.OPERATION -
CL_INVALID_
if host_timestamp or device_timestamp isVALUE NULL
. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To query the host clock, call the function:
cl_int clGetHostTimer(
cl_device_id device,
cl_ulong* host_timestamp);
clGetHostTimer is missing before version 2.1. |
-
device is a device returned by clGetDeviceIDs.
-
host_timestamp will be updated with the value of the current timer in nanoseconds. The resolution of the timer may be queried via clGetPlatformInfo and the flag
CL_PLATFORM_
.HOST_ TIMER_ RESOLUTION
clGetHostTimer returns the current value of the host clock as seen by device. This value is in the same timebase as the host_timestamp returned from clGetDeviceAndHostTimer. The implementation will return with as low a latency as possible to allow a correlation with a subsequent application sampled time. The host timestamp and device timestamp returned by this function and clGetDeviceAndHostTimer each have an implementation defined timebase. The timestamps will always be in their respective timebases regardless of which query function is used. The timestamp returned from clGetEventProfilingInfo for an event on a device and a device timestamp queried from the same device will always be in the same timebase.
clGetHostTimer will return CL_SUCCESS
with a time value in
host_timestamp if provided.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if device is not a valid device.DEVICE -
CL_INVALID_
if the platform associated with device does not support device and host timer synchronization.OPERATION -
CL_INVALID_
if host_timestamp isVALUE NULL
. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
4.3. Partitioning a Device
Partitioning devices is missing before version 1.2. |
To create sub-devices partitioning an OpenCL device, call the function:
cl_int clCreateSubDevices(
cl_device_id in_device,
const cl_device_partition_property* properties,
cl_uint num_devices,
cl_device_id* out_devices,
cl_uint* num_devices_ret);
clCreateSubDevices is missing before version 1.2. |
-
in_device is the device to be partitioned.
-
properties specifies how in_device is to be partitioned, described by a partition name and its corresponding value. Each partition name is immediately followed by the corresponding desired value. The list is terminated with 0. The list of supported partitioning schemes is described in the Subdevice Partition table. Only one of the listed partitioning schemes can be specified in properties.
-
num_devices is the size of memory pointed to by out_devices specified as the number of
cl_device_
entries.id -
out_devices is the buffer where the OpenCL sub-devices will be returned. If out_devices is
NULL
, this argument is ignored. If out_devices is notNULL
, num_devices must be greater than or equal to the number of sub-devices that device may be partitioned into according to the partitioning scheme specified in properties. -
num_devices_ret returns the number of sub-devices that device may be partitioned into according to the partitioning scheme specified in properties. If num_devices_ret is
NULL
, it is ignored.
clCreateSubDevices creates an array of sub-devices that each reference a non-intersecting set of compute units within in_device, according to the partition scheme given by properties. The output sub-devices may be used in every way that the root (or parent) device can be used, including creating contexts, building programs, further calls to clCreateSubDevices and creating command-queues. When a command-queue is created against a sub-device, the commands enqueued on the queue are executed only on the sub-device.
Partition Property | Partition Value | Description |
---|---|---|
Missing before version 1.2. |
|
Split the aggregate device into as many smaller aggregate devices as
can be created, each containing n compute units.
The value n is passed as the value accompanying this property.
If n does not divide evenly into
|
Missing before version 1.2. |
|
This property is followed by a list of compute unit counts
terminated with 0 or The number of non-zero count entries in the list may not exceed
The total number of compute units specified may not exceed
|
Missing before version 1.2. |
|
Split the device into smaller aggregate devices containing one or more compute units that all share part of a cache hierarchy. The value accompanying this property may be drawn from the following list: The user may determine what happened by calling
clGetDeviceInfo( |
clCreateSubDevices returns CL_SUCCESS
if the partition is created
successfully.
Otherwise, it returns a NULL
value with the following error values
returned in errcode_ret:
-
CL_INVALID_
if in_device is not a valid device.DEVICE -
CL_INVALID_
if values specified in properties are not valid or if values specified in properties are valid but not supported by the device.VALUE -
CL_INVALID_
if out_devices is notVALUE NULL
and num_devices is less than the number of sub-devices created by the partition scheme. -
CL_DEVICE_
if the partition name is supported by the implementation but in_device could not be further partitioned.PARTITION_ FAILED -
CL_INVALID_
if the partition name specified in properties isDEVICE_ PARTITION_ COUNT CL_DEVICE_
and the number of sub-devices requested exceedsPARTITION_ BY_ COUNTS CL_DEVICE_
or the total number of compute units requested exceedsPARTITION_ MAX_ SUB_ DEVICES CL_DEVICE_
for in_device, or the number of compute units requested for one or more sub-devices is less than zero or the number of sub-devices requested exceedsMAX_ COMPUTE_ UNITS CL_DEVICE_
for in_device.MAX_ COMPUTE_ UNITS -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
A few examples that describe how to specify partition properties in properties argument to clCreateSubDevices are given below:
To partition a device containing 16 compute units into two sub-devices, each containing 8 compute units, pass the following in properties:
{ CL_DEVICE_PARTITION_EQUALLY, 8,
0 } // 0 terminates the property list
To partition a device with four compute units into two sub-devices with one sub-device containing 3 compute units and the other sub-device 1 compute unit, pass the following in properties argument:
{ CL_DEVICE_PARTITION_BY_COUNTS,
3, 1, CL_DEVICE_PARTITION_BY_COUNTS_LIST_END,
0 } // 0 terminates the property list
To split a device along the outermost cache line (if any), pass the following in properties argument:
{ CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN,
CL_DEVICE_AFFINITY_DOMAIN_NEXT_PARTITIONABLE,
0 } // 0 terminates the property list
To retain a device, call the function:
cl_int clRetainDevice(
cl_device_id device);
clRetainDevice is missing before version 1.2. |
-
device is the OpenCL device to retain.
clRetainDevice increments the device reference count if device is a
valid sub-device created by a call to clCreateSubDevices.
If device is a root level device i.e. a cl_device_
returned by
clGetDeviceIDs, the device reference count remains unchanged.
clRetainDevice returns CL_SUCCESS
if the function is executed successfully
or the device is a root-level device.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if device is not a valid device.DEVICE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To release a device, call the function:
cl_int clReleaseDevice(
cl_device_id device);
clReleaseDevice is missing before version 1.2. |
-
device is the OpenCL device to release.
clReleaseDevice decrements the device reference count if device is a
valid sub-device created by a call to clCreateSubDevices.
If device is a root level device i.e. a cl_device_
returned by
clGetDeviceIDs, the device reference count remains unchanged.
clReleaseDevice returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if device is not a valid device.DEVICE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
After the device reference count becomes zero and all the objects attached to device (such as command-queues) are released, the device object is deleted. Using this function to release a reference that was not obtained by creating the object or by calling clRetainDevice causes undefined behavior.
4.4. Contexts
To create an OpenCL context, call the function:
cl_context clCreateContext(
const cl_context_properties* properties,
cl_uint num_devices,
const cl_device_id* devices,
void (CL_CALLBACK* pfn_notify)(const char* errinfo, const void* private_info, size_t cb, void* user_data),
void* user_data,
cl_int* errcode_ret);
-
properties specifies a list of context property names and their corresponding values. Each property name is immediately followed by the corresponding desired value. The list is terminated with 0. The list of supported properties is described in the Context Properties table. properties can be
NULL
in which case the platform that is selected is implementation-defined. -
num_devices is the number of devices specified in the devices argument.
-
devices is a pointer to a list of unique devices returned by clGetDeviceIDs or sub-devices created by clCreateSubDevices for a platform. [11]
-
pfn_notify is a callback function that can be registered by the application. This callback function will be used by the OpenCL implementation to report information on errors during context creation as well as errors that occur at runtime in this context. This callback function may be called asynchronously by the OpenCL implementation. It is the applications responsibility to ensure that the callback function is thread-safe. If pfn_notify is
NULL
, no callback function is registered. -
user_data will be passed as the user_data argument when pfn_notify is called. user_data can be
NULL
. -
errcode_ret will return an appropriate error code. If errcode_ret is
NULL
, no error code is returned.
The parameters to the callback function pfn_notify are:
-
errinfo is a pointer to an error string.
-
private_info and cb represent a pointer to binary data that is returned by the OpenCL implementation that can be used to log additional information helpful in debugging the error.
-
user_data is a pointer to user supplied data.
Contexts are used by the OpenCL runtime for managing objects such as command-queues, memory, program and kernel objects and for executing kernels on one or more devices specified in the context.
Context Property | Property Value | Description |
---|---|---|
|
Specifies the platform to use. |
|
Missing before version 1.2. |
|
Specifies whether the user is responsible for synchronization between OpenCL and other APIs. Please refer to the specific sections in the OpenCL Extension Specification that describe sharing with other APIs for restrictions on using this flag. If |
There are a number of cases where error notifications need to be delivered due to an error that occurs outside a context. Such notifications may not be delivered through the pfn_notify callback. Where these notifications go is implementation-defined. |
clCreateContext returns a valid non-zero context and errcode_ret is set
to CL_SUCCESS
if the context is created successfully.
Otherwise, it returns a NULL
value with the following error values
returned in errcode_ret:
-
CL_INVALID_
if properties isPLATFORM NULL
and no platform could be selected or if platform value specified in properties is not a valid platform. -
CL_INVALID_
if context property name in properties is not a supported property name, if the value specified for a supported property name is not valid, or if the same property name is specified more than once. This error code is missing before version 1.1.PROPERTY -
CL_INVALID_
if devices isVALUE NULL
. -
CL_INVALID_
if num_devices is equal to zero.VALUE -
CL_INVALID_
if pfn_notify isVALUE NULL
but user_data is notNULL
. -
CL_INVALID_
if any device in devices is not a valid device.DEVICE -
CL_DEVICE_
if a device in devices is currently not available even though the device was returned by clGetDeviceIDs.NOT_ AVAILABLE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
It is possible that a device(s) becomes unavailable after a context and command-queues that use this device(s) have been created and commands have been queued to command-queues. In this case the behavior of OpenCL API calls that use this context (and command-queues) are considered to be implementation-defined. The user callback function, if specified, when the context is created can be used to record appropriate information in the errinfo, private_info arguments passed to the callback function when the device becomes unavailable. |
To create an OpenCL context from a specific device type [12], call the function:
cl_context clCreateContextFromType(
const cl_context_properties* properties,
cl_device_type device_type,
void (CL_CALLBACK* pfn_notify)(const char* errinfo, const void* private_info, size_t cb, void* user_data),
void* user_data,
cl_int* errcode_ret);
-
properties specifies a list of context property names and their corresponding values. Each property name is immediately followed by the corresponding desired value. The list of supported properties is described in the Context Properties table. properties can also be
NULL
in which case the platform that is selected is implementation-defined. -
device_type is a bit-field that identifies the type of device and is described in the Device Types table.
-
pfn_notify and user_data are described in clCreateContext.
-
errcode_ret will return an appropriate error code. If errcode_ret is
NULL
, no error code is returned.
Only devices that are returned by clGetDeviceIDs for device_type are used to create the context. The context does not reference any sub-devices that may have been created from these devices.
clCreateContextFromType returns a valid non-zero context and errcode_ret
is set to CL_SUCCESS
if the context is created successfully.
Otherwise, it returns a NULL
value with the following error values
returned in errcode_ret:
-
CL_INVALID_
if properties isPLATFORM NULL
and no platform could be selected or if platform value specified in properties is not a valid platform. -
CL_INVALID_
if context property name in properties is not a supported property name, if the value specified for a supported property name is not valid, or if the same property name is specified more than once. This error code is missing before version 1.1.PROPERTY -
CL_INVALID_
if pfn_notify isVALUE NULL
but user_data is notNULL
. -
CL_INVALID_
if device_type is not a valid value.DEVICE_ TYPE -
CL_DEVICE_
if no devices that match device_type and property values specified in properties are currently available.NOT_ AVAILABLE -
CL_DEVICE_
if no devices that match device_type and property values specified in properties were found.NOT_ FOUND -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To retain a context, call the function:
cl_int clRetainContext(
cl_context context);
-
context specifies the OpenCL context to retain.
clRetainContext increments the context reference count.
clCreateContext and clCreateContextFromType perform an implicit retain. This is very helpful for 3rd party libraries, which typically get a context passed to them by the application. However, it is possible that the application may delete the context without informing the library. Allowing functions to attach to (i.e. retain) and release a context solves the problem of a context being used by a library no longer being valid.
clRetainContext returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if context is not a valid OpenCL context.CONTEXT -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To release a context, call the function:
cl_int clReleaseContext(
cl_context context);
-
context specifies the OpenCL context to release.
clReleaseContext decrements the context reference count. After the reference count becomes zero and all the objects attached to context (such as memory objects, command-queues) are released, the context is deleted. Using this function to release a reference that was not obtained by creating the object or by calling clRetainContext causes undefined behavior.
clReleaseContext returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if context is not a valid OpenCL context.CONTEXT -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To query information about a context, call the function:
cl_int clGetContextInfo(
cl_context context,
cl_context_info param_name,
size_t param_value_size,
void* param_value,
size_t* param_value_size_ret);
-
context specifies the OpenCL context being queried.
-
param_name is an enumeration constant that specifies the information to query.
-
param_value is a pointer to memory where the appropriate result being queried is returned. If param_value is
NULL
, it is ignored. -
param_value_size specifies the size in bytes of memory pointed to by param_value. This size must be greater than or equal to the size of return type as described in the Context Attributes table.
-
param_value_size_ret returns the actual size in bytes of data being queried by param_name. If param_value_size_ret is
NULL
, it is ignored.
The list of supported param_name values and the information returned in param_value by clGetContextInfo is described in the Context Attributes table.
Context Info | Return Type | Description |
---|---|---|
|
Return the context reference count. |
|
Missing before version 1.1. |
|
Return the number of devices in context. |
|
Return the list of devices and sub-devices in context. |
|
|
Return the properties argument specified in clCreateContext or clCreateContextFromType. If the properties argument specified in clCreateContext or
clCreateContextFromType used to create context was not If the properties argument specified in clCreateContext or
clCreateContextFromType used to create context was |
clGetContextInfo returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if context is not a valid context.CONTEXT -
CL_INVALID_
if param_name is not one of the supported values or if size in bytes specified by param_value_size is < size of return type as specified in the Context Attributes table and param_value is not aVALUE NULL
value. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To register a callback function with a context that is called when the context is destroyed, call the function
cl_int clSetContextDestructorCallback(
cl_context context,
void (CL_CALLBACK* pfn_notify)(cl_context context, void* user_data),
void* user_data);
clSetContextDestructorCallback is missing before version 3.0. |
-
context specifies the OpenCL context to register the callback to.
-
pfn_notify is the callback function to register. This callback function may be called asynchronously by the OpenCL implementation. It is the application’s responsibility to ensure that the callback function is thread-safe. The parameters to this callback function are:
-
context is the OpenCL context being deleted. When the callback function is called by the implementation, this context is no longer valid. context is only provided for reference purposes.
-
user_data is a pointer to user-supplied data.
-
-
user_data will be passed as the user_data argument when pfn_notify is called. user_data can be
NULL
.
Each call to clSetContextDestructorCallback registers the specified callback function on a destructor callback stack associated with context. The registered callback functions are called in the reverse order in which they were registered. If a context callback function was specified when context was created, it will not be called after any context destructor callback is called. Therefore, the context destructor callback provides a mechanism for an application to safely re-use or free any user_data specified for the context callback function when context was created.
clSetContextDestructorCallback returns CL_SUCCESS
if the function is
executed successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if context is not a valid context.CONTEXT -
CL_INVALID_
if pfn_notify isVALUE NULL
. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
5. The OpenCL Runtime
In this section we describe the API calls that manage OpenCL objects such as command-queues, memory objects, program objects, kernel objects for kernel functions in a program and calls that allow you to enqueue commands to a command-queue such as executing a kernel, reading, or writing a memory object.
5.1. Command Queues
OpenCL objects such as memory, program and kernel objects are created using a context. Operations on these objects are performed using a command-queue. The command-queue can be used to queue a set of operations (referred to as commands) in order. Having multiple command-queues allows applications to queue multiple independent commands without requiring synchronization. Note that this should work as long as these objects are not being shared. Sharing of objects across multiple command-queues will require the application to perform appropriate synchronization. This is described in Shared OpenCL Objects
To create a host or device command-queue on a specific device, call the function
cl_command_queue clCreateCommandQueueWithProperties(
cl_context context,
cl_device_id device,
const cl_queue_properties* properties,
cl_int* errcode_ret);
clCreateCommandQueueWithProperties is missing before version 2.0. Also see extension cl_khr_create_command_queue. |
-
context must be a valid OpenCL context.
-
device must be a device or sub-device associated with context. It can either be in the list of devices and sub-devices specified when context is created using clCreateContext or be a root device with the same device type as specified when context is created using clCreateContextFromType.
-
properties specifies a list of properties for the command-queue and their corresponding values. Each property name is immediately followed by the corresponding desired value. The list is terminated with 0. The list of supported properties is described in the table below. If a supported property and its value is not specified in properties, its default value will be used. properties can be
NULL
in which case the default values for supported command-queue properties will be used. -
errcode_ret will return an appropriate error code. If errcode_ret is
NULL
, no error code is returned.
Queue Property | Property Value | Description |
---|---|---|
|
This is a bitfield and can be set to a combination of the following values: If |
|
Missing before version 2.0. |
|
Specifies the size of the device queue in bytes. This can only be specified if For best performance, this should be ≤
If |
clCreateCommandQueueWithProperties returns a valid non-zero command-queue
and errcode_ret is set to CL_SUCCESS
if the command-queue is created
successfully.
Otherwise, it returns a NULL
value with one of the following error values
returned in errcode_ret:
-
CL_INVALID_
if context is not a valid context.CONTEXT -
CL_INVALID_
if device is not a valid device or is not associated with context.DEVICE -
CL_INVALID_
if values specified in properties are not valid.VALUE -
CL_INVALID_
if values specified in properties are valid but are not supported by the device.QUEUE_ PROPERTIES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To create a host command-queue on a specific device, call the function
cl_command_queue clCreateCommandQueue(
cl_context context,
cl_device_id device,
cl_command_queue_properties properties,
cl_int* errcode_ret);
clCreateCommandQueue is deprecated by version 2.0. |
-
context must be a valid OpenCL context.
-
device must be a device or sub-device associated with context. It can either be in the list of devices and sub-devices specified when context is created using clCreateContext or be a root device with the same device type as specified when context is created using clCreateContextFromType.
-
properties specifies a list of properties for the command-queue. This is a bit-field and the supported properties are described in the table below. Only command-queue properties specified in this table can be used, otherwise the value specified in properties is considered to be not valid. properties can be 0 in which case the default values for supported command-queue properties will be used.
Command-Queue Properties | Description |
---|---|
Determines whether the commands queued in the command-queue are executed in-order or out-of-order. If set, the commands in the command-queue are executed out-of-order. Otherwise, commands are executed in-order. |
|
Enable or disable profiling of commands in the command-queue. If set, the profiling of commands is enabled. Otherwise profiling of commands is disabled. |
-
errcode_ret will return an appropriate error code. If errcode_ret is
NULL
, no error code is returned.
clCreateCommandQueue returns a valid non-zero command-queue and errcode_ret
is set to CL_SUCCESS
if the command-queue is created successfully.
Otherwise, it returns a NULL
value with one of the following error values
returned in errcode_ret:
-
CL_INVALID_
if context is not a valid context.CONTEXT -
CL_INVALID_
if device is not a valid device or is not associated with context.DEVICE -
CL_INVALID_
if values specified in properties are not valid.VALUE -
CL_INVALID_
if values specified in properties are valid but are not supported by the device.QUEUE_ PROPERTIES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To replace the default command queue on a device, call the function
cl_int clSetDefaultDeviceCommandQueue(
cl_context context,
cl_device_id device,
cl_command_queue command_queue);
clSetDefaultDeviceCommandQueue is missing before version 2.1. |
-
context is the OpenCL context used to create command_queue.
-
device is a valid OpenCL device associated with context.
-
command_queue specifies a command queue object which replaces the default device command queue
clSetDefaultDeviceCommandQueue may be used to replace a default device
command queue created with clCreateCommandQueueWithProperties and the
CL_QUEUE_
flag.
clSetDefaultDeviceCommandQueue returns CL_SUCCESS
if the function is
executed successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if context is not a valid context.CONTEXT -
CL_INVALID_
if device is not a valid device or is not associated with context.DEVICE -
CL_INVALID_
if device does not support a replaceable default on-device queue.OPERATION -
CL_INVALID_
if command_queue is not a valid command-queue for device.COMMAND_ QUEUE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To retain a command queue, call the function
cl_int clRetainCommandQueue(
cl_command_queue command_queue);
-
command_queue specifies the command-queue to be retained.
The command_queue reference count is incremented.
clCreateCommandQueueWithProperties and clCreateCommandQueue perform an implicit retain. This is very helpful for 3rd party libraries, which typically get a command-queue passed to them by the application. However, it is possible that the application may delete the command-queue without informing the library. Allowing functions to attach to (i.e. retain) and release a command-queue solves the problem of a command-queue being used by a library no longer being valid.
clRetainCommandQueue returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid command-queue.COMMAND_ QUEUE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To release a command queue, call the function
cl_int clReleaseCommandQueue(
cl_command_queue command_queue);
-
command_queue specifies the command-queue to be released.
The command_queue reference count is decremented.
After the command_queue reference count becomes zero and all commands queued to command_queue have finished (eg. kernel-instances, memory object updates etc.), the command-queue is deleted.
clReleaseCommandQueue performs an implicit flush to issue any previously queued OpenCL commands in command_queue. Using this function to release a reference that was not obtained by creating the object or by calling clRetainCommandQueue causes undefined behavior.
clReleaseCommandQueue returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid command-queue.COMMAND_ QUEUE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To query information about a command-queue, call the function
cl_int clGetCommandQueueInfo(
cl_command_queue command_queue,
cl_command_queue_info param_name,
size_t param_value_size,
void* param_value,
size_t* param_value_size_ret);
-
command_queue specifies the command-queue being queried.
-
param_name specifies the information to query.
-
param_value is a pointer to memory where the appropriate result being queried is returned. If param_value is
NULL
, it is ignored. -
param_value_size is used to specify the size in bytes of memory pointed to by param_value. This size must be ≥ size of return type as described in the Command Queue Parameter table. If param_value is
NULL
, it is ignored. -
param_value_size_ret returns the actual size in bytes of data being queried by param_name. If param_value_size_ret is
NULL
, it is ignored.
The list of supported param_name values and the information returned in param_value by clGetCommandQueueInfo is described in the Command Queue Parameter table.
Queue Info | Return Type | Description |
---|---|---|
|
Return the context specified when the command-queue is created. |
|
|
Return the device specified when the command-queue is created. |
|
|
Return the command-queue reference count. |
|
|
Return the currently specified properties for the command-queue.
These properties are specified by the value associated with the
|
|
Missing before version 3.0. |
|
Return the properties argument specified in clCreateCommandQueueWithProperties. If the properties argument specified in
clCreateCommandQueueWithProperties used to create command_queue
was not If command_queue was created using clCreateCommandQueue, or if the
properties argument specified in clCreateCommandQueueWithProperties}
was |
Missing before version 2.0. |
|
Return the size of the device command-queue. To be considered valid for this query, command_queue must be a device command-queue. |
Missing before version 2.1. |
|
Return the current default command queue for the underlying device. |
clGetCommandQueueInfo returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid command-queue, or if command_queue is not a valid command-queue for param_name.COMMAND_ QUEUE -
CL_INVALID_
if param_name is not one of the supported values or if size in bytes specified by param_value_size is < size of return type as specified in the Command Queue Parameter table, and param_value is not aVALUE NULL
value. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To enable or disable the properties of a command-queue, call the function
cl_int clSetCommandQueueProperty(
cl_command_queue command_queue,
cl_command_queue_properties properties,
cl_bool enable,
cl_command_queue_properties* old_properties);
clSetCommandQueueProperty is deprecated by version 1.1. |
-
command_queue specifies the command-queue being modified.
-
properties specifies the new list of properties for the command-queue. This is a bit-field and the supported properties are described in the Command-Queue Properties table for clCreateCommandQueue. Only command-queue properties specified in this table can be used, otherwise the value specified in properties is considered to be not valid.
-
enable determines whether the values specified by properties are enabled (if enable is
CL_TRUE
) or disabled (if enable isCL_FALSE
) for the command-queue. -
old_properties returns the command-queue properties before they were changed by clSetCommandQueueProperty. If old_properties is
NULL
, it is ignored.
Changing the |
clSetCommandQueueProperty returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid command-queue.COMMAND_ QUEUE -
CL_INVALID_
if values specified in properties are not valid.VALUE -
CL_INVALID_
if values specified in properties are valid but are not supported by the device.QUEUE_ PROPERTIES
5.2. Buffer Objects
A buffer object stores a one-dimensional collection of elements. Elements of a buffer object can be a scalar data type (such as an int, float), vector data type, or a user-defined structure.
5.2.1. Creating Buffer Objects
A buffer object may be created using the function
cl_mem clCreateBuffer(
cl_context context,
cl_mem_flags flags,
size_t size,
void* host_ptr,
cl_int* errcode_ret);
A buffer object may also be created with additional properties using the function
cl_mem clCreateBufferWithProperties(
cl_context context,
const cl_mem_properties* properties,
cl_mem_flags flags,
size_t size,
void* host_ptr,
cl_int* errcode_ret);
clCreateBufferWithProperties is missing before version 3.0. |
-
context is a valid OpenCL context used to create the buffer object.
-
properties is an optional list of properties for the buffer object and their corresponding values. The list is terminated with the special property
0
. If no properties are required, properties may beNULL
. OpenCL 3.0 does not define any optional properties for buffers. -
flags is a bit-field that is used to specify allocation and usage information about the image memory object being created and is described in the supported memory flag values table.
-
size is the size in bytes of the buffer memory object to be allocated.
-
host_ptr is a pointer to the buffer data that may already be allocated by the application. The size of the buffer that host_ptr points to must be greater than or equal to size bytes.
-
errcode_ret may return an appropriate error code. If errcode_ret is
NULL
, no error code is returned.
The alignment requirements for data stored in buffer objects are described in Alignment of Application Data Types.
If clCreateBuffer or clCreateBufferWithProperties is called with
CL_MEM_
set in its flags argument, the contents of the
memory pointed to by host_ptr at the time of the clCreateBuffer call
define the initial contents of the buffer object.
If clCreateBuffer or clCreateBufferWithProperties is called with a
pointer returned by clSVMAlloc as its host_ptr argument, and
CL_MEM_
is set in its flags argument, clCreateBuffer or
clCreateBufferWithProperties will succeed and return a valid non-zero
buffer object as long as the size argument is no larger than the
size argument passed in the original clSVMAlloc call.
The new buffer object returned has the shared memory as the underlying
storage.
Locations in the buffers underlying shared memory can be operated on using
atomic operations to the devices level of support as defined in the memory
model.
clCreateBuffer and clCreateBufferWithProperties returns a valid non-zero
buffer object and errcode_ret is set to CL_SUCCESS
if the buffer object
is created successfully.
Otherwise, they return a NULL
value with one of the following error values
returned in errcode_ret:
-
CL_INVALID_
if context is not a valid context.CONTEXT -
CL_INVALID_
if a property name in properties is not a supported property name, if the value specified for a supported property name is not valid, or if the same property name is specified more than once.PROPERTY -
CL_INVALID_
if values specified in flags are not valid as defined in the Memory Flags table.VALUE -
CL_INVALID_
if size is 0 or if size is greater thanBUFFER_ SIZE CL_DEVICE_
for all devices in context.MAX_ MEM_ ALLOC_ SIZE -
CL_INVALID_
if host_ptr isHOST_ PTR NULL
andCL_MEM_
orUSE_ HOST_ PTR CL_MEM_
are set in flags or if host_ptr is notCOPY_ HOST_ PTR NULL
butCL_MEM_
orCOPY_ HOST_ PTR CL_MEM_
are not set in flags.USE_ HOST_ PTR -
CL_MEM_
if there is a failure to allocate memory for buffer object.OBJECT_ ALLOCATION_ FAILURE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
Memory Flags | Description |
---|---|
This flag specifies that the memory object will be read and written by a kernel. This is the default. |
|
This flag specifies that the memory object will be written but not read by a kernel. Reading from a buffer or image object created with |
|
This flag specifies that the memory object is a readonly memory object when used inside a kernel. Writing to a buffer or image object created with |
|
This flag is valid only if host_ptr is not The contents of the memory pointed to by host_ptr at the time of the clCreateBuffer, clCreateBufferWithProperties, clCreateImage, clCreateImageWithProperties, clCreateImage2D, or clCreateImage3D call define the initial contents of the memory object. OpenCL implementations are allowed to cache the contents pointed to by host_ptr in device memory. This cached copy can be used when kernels are executed on a device. The result of OpenCL commands that operate on multiple buffer objects created with the same host_ptr or from overlapping host or SVM regions is considered to be undefined. |
|
This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory. |
|
This flag is valid only if host_ptr is not |
|
Missing before version 1.2. |
This flag specifies that the host will only write to the memory object (using OpenCL APIs that enqueue a write or a map for write). This can be used to optimize write access from the host (e.g. enable write-combined allocations for memory objects for devices that communicate with the host over a system bus such as PCIe). |
Missing before version 1.2. |
This flag specifies that the host will only read the memory object (using OpenCL APIs that enqueue a read or a map for read). |
Missing before version 1.2. |
This flag specifies that the host will not read or write the memory object. |
Missing before version 2.0. |
This flag is only used by clGetSupportedImageFormats to query image
formats that may be both read from and written to by the same kernel
instance.
To create a memory object that may be read from and written to use
|
To create a new buffer object (referred to as a sub-buffer object) from an existing buffer object, call the function
cl_mem clCreateSubBuffer(
cl_mem buffer,
cl_mem_flags flags,
cl_buffer_create_type buffer_create_type,
const void* buffer_create_info,
cl_int* errcode_ret);
clCreateSubBuffer is missing before version 1.1. |
-
buffer must be a valid buffer object and cannot be a sub-buffer object.
-
flags is a bit-field that is used to specify allocation and usage information about the sub-buffer memory object being created and is described in the Memory Flags table. If the
CL_MEM_
,READ_ WRITE CL_MEM_
, orREAD_ ONLY CL_MEM_
values are not specified in flags, they are inherited from the corresponding memory access qualifiers associated with buffer. TheWRITE_ ONLY CL_MEM_
,USE_ HOST_ PTR CL_MEM_
, andALLOC_ HOST_ PTR CL_MEM_
values cannot be specified in flags but are inherited from the corresponding memory access qualifiers associated with buffer. IfCOPY_ HOST_ PTR CL_MEM_
is specified in the memory access qualifier values associated with buffer it does not imply any additional copies when the sub-buffer is created from buffer. If theCOPY_ HOST_ PTR CL_MEM_
,HOST_ WRITE_ ONLY CL_MEM_
, orHOST_ READ_ ONLY CL_MEM_
values are not specified in flags, they are inherited from the corresponding memory access qualifiers associated with buffer.HOST_ NO_ ACCESS -
buffer_create_type and buffer_create_info describe the type of buffer object to be created. The list of supported values for buffer_create_type and corresponding descriptor that buffer_create_info points to is described in the SubBuffer Attributes table.
Buffer Creation Type | Description |
---|---|
Missing before version 1.1. |
Create a buffer object that represents a specific region in buffer. buffer_create_info is a pointer to a If buffer is created with The buffer object returned references the data store allocated for buffer and points to the region specified by buffer_create_info in this data store. |
clCreateSubBuffer returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors in errcode_ret:
-
CL_INVALID_
if buffer is not a valid buffer object or is a sub-buffer object.MEM_ OBJECT -
CL_INVALID_
if buffer was created withVALUE CL_MEM_
and flags specifiesWRITE_ ONLY CL_MEM_
orREAD_ WRITE CL_MEM_
, or if buffer was created withREAD_ ONLY CL_MEM_
and flags specifiesREAD_ ONLY CL_MEM_
orREAD_ WRITE CL_MEM_
, or if flags specifiesWRITE_ ONLY CL_MEM_
orUSE_ HOST_ PTR CL_MEM_
orALLOC_ HOST_ PTR CL_MEM_
.COPY_ HOST_ PTR -
CL_INVALID_
if buffer was created withVALUE CL_MEM_
and flags specifyHOST_ WRITE_ ONLY CL_MEM_
, or if buffer was created withHOST_ READ_ ONLY CL_MEM_
and flags specifyHOST_ READ_ ONLY CL_MEM_
, or if buffer was created withHOST_ WRITE_ ONLY CL_MEM_
and flags specifyHOST_ NO_ ACCESS CL_MEM_
orHOST_ READ_ ONLY CL_MEM_
.HOST_ WRITE_ ONLY -
CL_INVALID_
if the value specified in buffer_create_type is not valid.VALUE -
CL_INVALID_
if value(s) specified in buffer_create_info (for a given buffer_create_type) is not valid or if buffer_create_info isVALUE NULL
. -
CL_MEM_
if there is a failure to allocate memory for sub-buffer object.OBJECT_ ALLOCATION_ FAILURE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY -
CL_INVALID_
if the region specified by theVALUE cl_buffer_
structure passed in buffer_create_info is out of bounds in buffer.region -
CL_INVALID_
if the size field of theBUFFER_ SIZE cl_buffer_
structure passed in buffer_create_info is 0.region -
CL_MISALIGNED_
if there are no devices in context associated with buffer for which the origin field of theSUB_ BUFFER_ OFFSET cl_buffer_
structure passed in buffer_create_info is aligned to theregion CL_DEVICE_
value.MEM_ BASE_ ADDR_ ALIGN
Concurrent reading from, writing to and copying between both a buffer object and its sub-buffer object(s) is undefined. Concurrent reading from, writing to and copying between overlapping sub-buffer objects created with the same buffer object is undefined. Only reading from both a buffer object and its sub-buffer objects or reading from multiple overlapping sub-buffer objects is defined. |
The cl_buffer_
structure specifies a region of a buffer object:
typedef struct cl_buffer_region {
size_t origin;
size_t size;
} cl_buffer_region;
-
origin is the offset in bytes of the region.
-
size is the size in bytes of the region.
Constraints on the values of origin and size are specified for the clCreateSubBuffer function to which this structure is passed.
5.2.2. Reading, Writing and Copying Buffer Objects
The following functions enqueue commands to read from a buffer object to host memory or write to a buffer object from host memory.
To read from a buffer object to host memory or to write to a buffer object from host memory call one of the functions
cl_int clEnqueueReadBuffer(
cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_read,
size_t offset,
size_t size,
void* ptr,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
cl_int clEnqueueWriteBuffer(
cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_write,
size_t offset,
size_t size,
const void* ptr,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
-
command_queue is a valid host command-queue in which the read / write command will be queued. command_queue and buffer must be created with the same OpenCL context.
-
buffer refers to a valid buffer object.
-
blocking_read and blocking_write indicate if the read and write operations are blocking or non-blocking (see below).
-
offset is the offset in bytes in the buffer object to read from or write to.
-
size is the size in bytes of data being read or written.
-
ptr is the pointer to buffer in host memory where data is to be read into or to be written from.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. If event_wait_list is
NULL
, then this particular command does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this read / write command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array.
If blocking_read is CL_TRUE
i.e. the read command is blocking,
clEnqueueReadBuffer does not return until the buffer data has been read
and copied into memory pointed to by ptr.
If blocking_read is CL_FALSE
i.e. the read command is non-blocking,
clEnqueueReadBuffer queues a non-blocking read command and returns.
The contents of the buffer that ptr points to cannot be used until the
read command has completed.
The event argument returns an event object which can be used to query the
execution status of the read command.
When the read command has completed, the contents of the buffer that ptr
points to can be used by the application.
If blocking_write is CL_TRUE
, the write command is blocking and does not
return until the command is complete, including transfer of the data.
The memory pointed to by ptr can be reused by the application after the
clEnqueueWriteBuffer call returns.
If blocking_write is CL_FALSE
, the OpenCL implementation will use ptr to
perform a non-blocking write.
As the write is non-blocking the implementation can return immediately.
The memory pointed to by ptr cannot be reused by the application after the
call returns.
The event argument returns an event object which can be used to query the
execution status of the write command.
When the write command has completed, the memory pointed to by ptr can
then be reused by the application.
clEnqueueReadBuffer and clEnqueueWriteBuffer return CL_SUCCESS
if the
function is executed successfully.
Otherwise, they return one of the following errors:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if the context associated with command_queue and buffer are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if buffer is not a valid buffer object.MEM_ OBJECT -
CL_INVALID_
if the region being read or written specified by (offset, size) is out of bounds or if ptr is aVALUE NULL
value. -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_MISALIGNED_
if buffer is a sub-buffer object and offset specified when the sub-buffer object is created is not aligned toSUB_ BUFFER_ OFFSET CL_DEVICE_
value for device associated with queue. This error code is missing before version 1.1.MEM_ BASE_ ADDR_ ALIGN -
CL_EXEC_
if the read and write operations are blocking and the execution status of any of the events in event_wait_list is a negative integer value. This error code is missing before version 1.1.STATUS_ ERROR_ FOR_ EVENTS_ IN_ WAIT_ LIST -
CL_MEM_
if there is a failure to allocate memory for data store associated with buffer.OBJECT_ ALLOCATION_ FAILURE -
CL_INVALID_
if clEnqueueReadBuffer is called on buffer which has been created withOPERATION CL_MEM_
orHOST_ WRITE_ ONLY CL_MEM_
.HOST_ NO_ ACCESS -
CL_INVALID_
if clEnqueueWriteBuffer is called on buffer which has been created withOPERATION CL_MEM_
orHOST_ READ_ ONLY CL_MEM_
.HOST_ NO_ ACCESS -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
The following functions enqueue commands to read a 2D or 3D rectangular region from a buffer object to host memory or write a 2D or 3D rectangular region to a buffer object from host memory.
cl_int clEnqueueReadBufferRect(
cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_read,
const size_t* buffer_origin,
const size_t* host_origin,
const size_t* region,
size_t buffer_row_pitch,
size_t buffer_slice_pitch,
size_t host_row_pitch,
size_t host_slice_pitch,
void* ptr,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
clEnqueueReadBufferRect is missing before version 1.1. |
cl_int clEnqueueWriteBufferRect(
cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_write,
const size_t* buffer_origin,
const size_t* host_origin,
const size_t* region,
size_t buffer_row_pitch,
size_t buffer_slice_pitch,
size_t host_row_pitch,
size_t host_slice_pitch,
const void* ptr,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
clEnqueueWriteBufferRect is missing before version 1.1. |
-
command_queue refers is a valid host command-queue in which the read / write command will be queued. command_queue and buffer must be created with the same OpenCL context.
-
buffer refers to a valid buffer object.
-
blocking_read and blocking_write indicate if the read and write operations are blocking or non-blocking (see below).
-
buffer_origin defines the (x, y, z) offset in the memory region associated with buffer. For a 2D rectangle region, the z value given by buffer_origin[2] should be 0. The offset in bytes is computed as buffer_origin[2] × buffer_slice_pitch + buffer_origin[1] × buffer_row_pitch + buffer_origin[0].
-
host_origin defines the (x, y, z) offset in the memory region pointed to by ptr. For a 2D rectangle region, the z value given by host_origin[2] should be 0. The offset in bytes is computed as host_origin[2] × host_slice_pitch + host_origin[1] × host_row_pitch + host_origin[0].
-
region defines the (width in bytes, height in rows, depth in slices) of the 2D or 3D rectangle being read or written. For a 2D rectangle copy, the depth value given by region[2] should be 1. The values in region cannot be 0.
-
buffer_row_pitch is the length of each row in bytes to be used for the memory region associated with buffer. If buffer_row_pitch is 0, buffer_row_pitch is computed as region[0].
-
buffer_slice_pitch is the length of each 2D slice in bytes to be used for the memory region associated with buffer. If buffer_slice_pitch is 0, buffer_slice_pitch is computed as region[1] × buffer_row_pitch.
-
host_row_pitch is the length of each row in bytes to be used for the memory region pointed to by ptr. If host_row_pitch is 0, host_row_pitch is computed as region[0].
-
host_slice_pitch is the length of each 2D slice in bytes to be used for the memory region pointed to by ptr. If host_slice_pitch is 0, host_slice_pitch is computed as region[1] × host_row_pitch.
-
ptr is the pointer to buffer in host memory where data is to be read into or to be written from.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. If event_wait_list is
NULL
, then this particular command does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this read / write command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array.
If blocking_read is CL_TRUE
i.e. the read command is blocking,
clEnqueueReadBufferRect does not return until the buffer data has been
read and copied into memory pointed to by ptr.
If blocking_read is CL_FALSE
i.e. the read command is non-blocking,
clEnqueueReadBufferRect queues a non-blocking read command and returns.
The contents of the buffer that ptr points to cannot be used until the
read command has completed.
The event argument returns an event object which can be used to query the
execution status of the read command.
When the read command has completed, the contents of the buffer that ptr
points to can be used by the application.
If blocking_write is CL_TRUE
, the write command is blocking and does not
return until the command is complete, including transfer of the data.
The memory pointed to by ptr can be reused by the application after the
clEnqueueWriteBufferRect call returns.
If blocking_write is CL_FALSE
, the OpenCL implementation will use ptr to
perform a non-blocking write.
As the write is non-blocking the implementation can return immediately.
The memory pointed to by ptr cannot be reused by the application after the
call returns.
The event argument returns an event object which can be used to query the
execution status of the write command.
When the write command has completed, the memory pointed to by ptr can
then be reused by the application.
clEnqueueReadBufferRect and clEnqueueWriteBufferRect return CL_SUCCESS
if the function is executed successfully.
Otherwise, they return one of the following errors:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if the context associated with command_queue and buffer are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if buffer is not a valid buffer object.MEM_ OBJECT -
CL_INVALID_
if buffer_origin, host_origin, or region isVALUE NULL
. -
CL_INVALID_
if the region being read or written specified by (buffer_origin, region, buffer_row_pitch, buffer_slice_pitch) is out of bounds.VALUE -
CL_INVALID_
if any region array element is 0.VALUE -
CL_INVALID_
if buffer_row_pitch is not 0 and is less than region[0].VALUE -
CL_INVALID_
if host_row_pitch is not 0 and is less than region[0].VALUE -
CL_INVALID_
if buffer_slice_pitch is not 0 and is less than region[1] × buffer_row_pitch and not a multiple of buffer_row_pitch.VALUE -
CL_INVALID_
if host_slice_pitch is not 0 and is less than region[1] × host_row_pitch and not a multiple of host_row_pitch.VALUE -
CL_INVALID_
if ptr isVALUE NULL
. -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_MISALIGNED_
if buffer is a sub-buffer object and offset specified when the sub-buffer object is created is not aligned toSUB_ BUFFER_ OFFSET CL_DEVICE_
value for device associated with queue. This error code is missing before version 1.1.MEM_ BASE_ ADDR_ ALIGN -
CL_EXEC_
if the read and write operations are blocking and the execution status of any of the events in event_wait_list is a negative integer value. This error code is missing before version 1.1.STATUS_ ERROR_ FOR_ EVENTS_ IN_ WAIT_ LIST -
CL_MEM_
if there is a failure to allocate memory for data store associated with buffer.OBJECT_ ALLOCATION_ FAILURE -
CL_INVALID_
if clEnqueueReadBufferRect is called on buffer which has been created withOPERATION CL_MEM_
orHOST_ WRITE_ ONLY CL_MEM_
.HOST_ NO_ ACCESS -
CL_INVALID_
if clEnqueueWriteBufferRect is called on buffer which has been created withOPERATION CL_MEM_
orHOST_ READ_ ONLY CL_MEM_
.HOST_ NO_ ACCESS -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
Calling clEnqueueReadBuffer to read a region of the buffer object with the
ptr argument value set to host_ptr + offset, where host_ptr is a
pointer to the memory region specified when the buffer object being read is
created with
Calling clEnqueueReadBufferRect to read a region of the buffer object with
the ptr argument value set to host_ptr and host_origin,
buffer_origin values are the same, where host_ptr is a pointer to the
memory region specified when the buffer object being read is created with
Calling clEnqueueWriteBuffer to update the latest bits in a region of the
buffer object with the ptr argument value set to host_ptr + offset,
where host_ptr is a pointer to the memory region specified when the buffer
object being written is created with
Calling clEnqueueWriteBufferRect to update the latest bits in a region of
the buffer object with the ptr argument value set to host_ptr and
host_origin, buffer_origin values are the same, where host_ptr is a
pointer to the memory region specified when the buffer object being written
is created with
|
To enqueue a command to copy a buffer object identified by src_buffer to another buffer object identified by dst_buffer, call the function
cl_int clEnqueueCopyBuffer(
cl_command_queue command_queue,
cl_mem src_buffer,
cl_mem dst_buffer,
size_t src_offset,
size_t dst_offset,
size_t size,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
-
command_queue refers to a host command-queue in which the copy command will be queued. The OpenCL context associated with command_queue, src_buffer and dst_buffer must be the same.
-
src_offset refers to the offset where to begin copying data from src_buffer.
-
dst_offset refers to the offset where to begin copying data into dst_buffer.
-
size refers to the size in bytes to copy.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. If event_wait_list is
NULL
, then this particular command does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this copy command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array.
clEnqueueCopyBuffer returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if the context associated with command_queue, src_buffer and dst_buffer are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if src_buffer and dst_buffer are not valid buffer objects.MEM_ OBJECT -
CL_INVALID_
if src_offset, dst_offset, size, src_offset + size or dst_offset + size require accessing elements outside the src_buffer and dst_buffer buffer objects respectively.VALUE -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_MISALIGNED_
if src_buffer is a sub-buffer object and offset specified when the sub-buffer object is created is not aligned toSUB_ BUFFER_ OFFSET CL_DEVICE_
value for device associated with queue. This error code is missing before version 1.1.MEM_ BASE_ ADDR_ ALIGN -
CL_MISALIGNED_
if dst_buffer is a sub-buffer object and offset specified when the sub-buffer object is created is not aligned toSUB_ BUFFER_ OFFSET CL_DEVICE_
value for device associated with queue. This error code is missing before version 1.1.MEM_ BASE_ ADDR_ ALIGN -
CL_MEM_
if src_buffer and dst_buffer are the same buffer or sub-buffer object and the source and destination regions overlap or if src_buffer and dst_buffer are different sub-buffers of the same associated buffer object and they overlap. The regions overlap if src_offset ≤ dst_offset ≤ src_offset + size - 1 or if dst_offset ≤ src_offset ≤ dst_offset + size - 1.COPY_ OVERLAP -
CL_MEM_
if there is a failure to allocate memory for data store associated with src_buffer or dst_buffer.OBJECT_ ALLOCATION_ FAILURE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To enqueue a command to copy a 2D or 3D rectangular region from the buffer object identified by src_buffer to a 2D or 3D region in the buffer object identified by dst_buffer, call the function
cl_int clEnqueueCopyBufferRect(
cl_command_queue command_queue,
cl_mem src_buffer,
cl_mem dst_buffer,
const size_t* src_origin,
const size_t* dst_origin,
const size_t* region,
size_t src_row_pitch,
size_t src_slice_pitch,
size_t dst_row_pitch,
size_t dst_slice_pitch,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
clEnqueueCopyBufferRect is missing before version 1.1. |
-
command_queue refers to the host command-queue in which the copy command will be queued. The OpenCL context associated with command_queue, src_buffer and dst_buffer must be the same.
-
src_origin defines the (x, y, z) offset in the memory region associated with src_buffer. For a 2D rectangle region, the z value given by src_origin[2] should be 0. The offset in bytes is computed as src_origin[2] × src_slice_pitch + src_origin[1] × src_row_pitch + src_origin[0].
-
dst_origin defines the (x, y, z) offset in the memory region associated with dst_buffer. For a 2D rectangle region, the z value given by dst_origin[2] should be 0. The offset in bytes is computed as dst_origin[2] × dst_slice_pitch + dst_origin[1] × dst_row_pitch + dst_origin[0].
-
region defines the (width in bytes, height in rows, depth in slices) of the 2D or 3D rectangle being copied. For a 2D rectangle, the depth value given by region[2] should be 1. The values in region cannot be 0.
-
src_row_pitch is the length of each row in bytes to be used for the memory region associated with src_buffer. If src_row_pitch is 0, src_row_pitch is computed as region[0].
-
src_slice_pitch is the length of each 2D slice in bytes to be used for the memory region associated with src_buffer. If src_slice_pitch is 0, src_slice_pitch is computed as region[1] × src_row_pitch.
-
dst_row_pitch is the length of each row in bytes to be used for the memory region associated with dst_buffer. If dst_row_pitch is 0, dst_row_pitch is computed as region[0].
-
dst_slice_pitch is the length of each 2D slice in bytes to be used for the memory region associated with dst_buffer. If dst_slice_pitch is 0, dst_slice_pitch is computed as region[1] × dst_row_pitch.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. If event_wait_list is
NULL
, then this particular command does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this copy command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array.
Copying begins at the source offset and destination offset which are computed as described below in the description for src_origin and dst_origin. Each byte of the region’s width is copied from the source offset to the destination offset. After copying each width, the source and destination offsets are incremented by their respective source and destination row pitches. After copying each 2D rectangle, the source and destination offsets are incremented by their respective source and destination slice pitches.
If src_buffer and dst_buffer are the same buffer object, src_row_pitch must equal dst_row_pitch and src_slice_pitch must equal dst_slice_pitch. |
clEnqueueCopyBufferRect returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if the context associated with command_queue, src_buffer and dst_buffer are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if src_buffer and dst_buffer are not valid buffer objects.MEM_ OBJECT -
CL_INVALID_
if src_origin, dst_origin, or region isVALUE NULL
. -
CL_INVALID_
if (src_origin, region, src_row_pitch, src_slice_pitch) or (dst_origin, region, dst_row_pitch, dst_slice_pitch) require accessing elements outside the src_buffer and dst_buffer buffer objects respectively.VALUE -
CL_INVALID_
if any region array element is 0.VALUE -
CL_INVALID_
if src_row_pitch is not 0 and is less than region[0].VALUE -
CL_INVALID_
if dst_row_pitch is not 0 and is less than region[0].VALUE -
CL_INVALID_
if src_slice_pitch is not 0 and is less than region[1] × src_row_pitch or if src_slice_pitch is not 0 and is not a multiple of src_row_pitch.VALUE -
CL_INVALID_
if dst_slice_pitch is not 0 and is less than region[1] × dst_row_pitch or if dst_slice_pitch is not 0 and is not a multiple of dst_row_pitch.VALUE -
CL_INVALID_
if src_buffer and dst_buffer are the same buffer object and src_slice_pitch is not equal to dst_slice_pitch and src_row_pitch is not equal to dst_row_pitch.VALUE -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_MEM_
if src_buffer and dst_buffer are the same buffer or sub-buffer object and the source and destination regions overlap or if src_buffer and dst_buffer are different sub-buffers of the same associated buffer object and they overlap. Refer to Checking for Memory Copy Overlap for details on how to determine if source and destination regions overlap.COPY_ OVERLAP -
CL_MISALIGNED_
if src_buffer is a sub-buffer object and offset specified when the sub-buffer object is created is not aligned toSUB_ BUFFER_ OFFSET CL_DEVICE_
value for device associated with queue. This error code is missing before version 1.1.MEM_ BASE_ ADDR_ ALIGN -
CL_MISALIGNED_
if dst_buffer is a sub-buffer object and offset specified when the sub-buffer object is created is not aligned toSUB_ BUFFER_ OFFSET CL_DEVICE_
value for device associated with queue. This error code is missing before version 1.1.MEM_ BASE_ ADDR_ ALIGN -
CL_MEM_
if there is a failure to allocate memory for data store associated with src_buffer or dst_buffer.OBJECT_ ALLOCATION_ FAILURE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
5.2.3. Filling Buffer Objects
Filling buffer objects is missing before version 1.2. |
To enqueue a command to fill a buffer object with a pattern of a given pattern size, call the function
cl_int clEnqueueFillBuffer(
cl_command_queue command_queue,
cl_mem buffer,
const void* pattern,
size_t pattern_size,
size_t offset,
size_t size,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
clEnqueueFillBuffer is missing before version 1.2. |
-
command_queue refers to the host command-queue in which the fill command will be queued. The OpenCL context associated with command_queue and buffer must be the same.
-
buffer is a valid buffer object.
-
pattern is a pointer to the data pattern of size pattern_size in bytes. pattern will be used to fill a region in buffer starting at offset and is size bytes in size. The data pattern must be a scalar or vector integer or floating-point data type supported by OpenCL as described in Shared Application Scalar Data Types and Supported Application Vector Data Types. For example, if buffer is to be filled with a pattern of
float4
values, then pattern will be a pointer to acl_float4
value and pattern_size will besizeof(cl_float4)
. The maximum value of pattern_size is the size of the largest integer or floating-point vector data type supported by the OpenCL device. The memory associated with pattern can be reused or freed after the function returns. -
offset is the location in bytes of the region being filled in buffer and must be a multiple of pattern_size.
-
size is the size in bytes of region being filled in buffer and must be a multiple of pattern_size.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. If event_wait_list is
NULL
, then this particular command does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array.
The usage information which indicates whether the memory object can be read
or written by a kernel and/or the host and is given by the cl_mem_
argument value specified when buffer is created is ignored by
clEnqueueFillBuffer.
clEnqueueFillBuffer returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if the context associated with command_queue and buffer are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if buffer is not a valid buffer object.MEM_ OBJECT -
CL_INVALID_
if offset or offset + size require accessing elements outside the buffer buffer object respectively.VALUE -
CL_INVALID_
if pattern isVALUE NULL
or if pattern_size is 0 or if pattern_size is not one of { 1, 2, 4, 8, 16, 32, 64, 128 }. -
CL_INVALID_
if offset and size are not a multiple of pattern_size.VALUE -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_MISALIGNED_
if buffer is a sub-buffer object and offset specified when the sub-buffer object is created is not aligned toSUB_ BUFFER_ OFFSET CL_DEVICE_
value for device associated with queue. This error code is missing before version 1.1.MEM_ BASE_ ADDR_ ALIGN -
CL_MEM_
if there is a failure to allocate memory for data store associated with buffer.OBJECT_ ALLOCATION_ FAILURE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
5.2.4. Mapping Buffer Objects
To enqueue a command to map a region of the buffer object given by buffer into the host address space and returns a pointer to this mapped region, call the function
void* clEnqueueMapBuffer(
cl_command_queue command_queue,
cl_mem buffer,
cl_bool blocking_map,
cl_map_flags map_flags,
size_t offset,
size_t size,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event,
cl_int* errcode_ret);
-
command_queue must be a valid host command-queue.
-
blocking_map indicates if the map operation is blocking or non-blocking.
If blocking_map is CL_TRUE
, clEnqueueMapBuffer does not return until the
specified region in buffer is mapped into the host address space and the
application can access the contents of the mapped region using the pointer
returned by clEnqueueMapBuffer.
If blocking_map is CL_FALSE
i.e. map operation is non-blocking, the
pointer to the mapped region returned by clEnqueueMapBuffer cannot be used
until the map command has completed.
The event argument returns an event object which can be used to query the
execution status of the map command.
When the map command is completed, the application can access the contents
of the mapped region using the pointer returned by clEnqueueMapBuffer.
-
map_flags is a bit-field and is described in the Memory Map Flags table.
-
buffer is a valid buffer object. The OpenCL context associated with command_queue and buffer must be the same.
-
offset and size are the offset in bytes and the size of the region in the buffer object that is being mapped.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. If event_wait_list is
NULL
, then this particular command does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array. -
errcode_ret will return an appropriate error code. If errcode_ret is
NULL
, no error code is returned.
clEnqueueMapBuffer will return a pointer to the mapped region.
The errcode_ret is set to CL_SUCCESS
.
A NULL
pointer is returned otherwise with one of the following error
values returned in errcode_ret:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if context associated with command_queue and buffer are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if buffer is not a valid buffer object.MEM_ OBJECT -
CL_INVALID_
if region being mapped given by (offset, size) is out of bounds or if size is 0 or if values specified in map_flags are not valid.VALUE -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_MISALIGNED_
if buffer is a sub-buffer object and offset specified when the sub-buffer object is created is not aligned toSUB_ BUFFER_ OFFSET CL_DEVICE_
value for the device associated with queue. This error code is missing before version 1.1.MEM_ BASE_ ADDR_ ALIGN -
CL_MAP_
if there is a failure to map the requested region into the host address space. This error cannot occur for buffer objects created withFAILURE CL_MEM_
orUSE_ HOST_ PTR CL_MEM_
.ALLOC_ HOST_ PTR -
CL_EXEC_
if the map operation is blocking and the execution status of any of the events in event_wait_list is a negative integer value. This error code is missing before version 1.1.STATUS_ ERROR_ FOR_ EVENTS_ IN_ WAIT_ LIST -
CL_MEM_
if there is a failure to allocate memory for data store associated with buffer.OBJECT_ ALLOCATION_ FAILURE -
CL_INVALID_
if buffer has been created withOPERATION CL_MEM_
orHOST_ WRITE_ ONLY CL_MEM_
andHOST_ NO_ ACCESS CL_MAP_
is set in map_flags or if buffer has been created withREAD CL_MEM_
orHOST_ READ_ ONLY CL_MEM_
andHOST_ NO_ ACCESS CL_MAP_
orWRITE CL_MAP_
is set in map_flags.WRITE_ INVALIDATE_ REGION -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY -
CL_INVALID_
if mapping would lead to overlapping regions being mapped for writing.OPERATION
The pointer returned maps a region starting at offset and is at least size bytes in size. The result of a memory access outside this region is undefined.
If the buffer object is created with CL_MEM_
set in mem_flags,
the following will be true:
-
The host_ptr specified in clCreateBuffer or clCreateBufferWithProperties will contain the latest bits in the region being mapped when the clEnqueueMapBuffer command has completed.
-
The pointer value returned by clEnqueueMapBuffer will be derived from the host_ptr specified when the buffer object is created.
Mapped buffer objects are unmapped using clEnqueueUnmapMemObject. This is described in Unmapping Mapped Memory Objects.
Map Flags | Description |
---|---|
This flag specifies that the region being mapped in the memory object is being mapped for reading. The pointer returned by clEnqueueMapBuffer (clEnqueueMapImage) is guaranteed to contain the latest bits in the region being mapped when the clEnqueueMapBuffer (clEnqueueMapImage) command has completed. |
|
This flag specifies that the region being mapped in the memory object is being mapped for writing. The pointer returned by clEnqueueMapBuffer (clEnqueueMapImage) is guaranteed to contain the latest bits in the region being mapped when the clEnqueueMapBuffer (clEnqueueMapImage) command has completed |
|
Missing before version 1.2. |
This flag specifies that the region being mapped in the memory object is being mapped for writing. The contents of the region being mapped are to be discarded. This is typically the case when the region being mapped is overwritten by the host. This flag allows the implementation to no longer guarantee that the pointer returned by clEnqueueMapBuffer (clEnqueueMapImage) contains the latest bits in the region being mapped which can be a significant performance enhancement. |
5.3. Image Objects
An image object is used to store a one-, two- or three-dimensional texture, frame-buffer or image. The elements of an image object are selected from a list of predefined image formats. The minimum number of elements in a memory object is one.
5.3.1. Creating Image Objects
An image object may be created using the function
cl_mem clCreateImage(
cl_context context,
cl_mem_flags flags,
const cl_image_format* image_format,
const cl_image_desc* image_desc,
void* host_ptr,
cl_int* errcode_ret);
clCreateImage is missing before version 1.2. |
An image object may also be created with additional properties using the function
cl_mem clCreateImageWithProperties(
cl_context context,
const cl_mem_properties* properties,
cl_mem_flags flags,
const cl_image_format* image_format,
const cl_image_desc* image_desc,
void* host_ptr,
cl_int* errcode_ret);
clCreateImageWithProperties is missing before version 3.0. |
-
context is a valid OpenCL context used to create the image object.
-
properties is an optional list of properties for the image object and their corresponding values. The list is terminated with the special property
0
. If no properties are required, properties may beNULL
. OpenCL 3.0 does not define any optional properties for images. -
flags is a bit-field that is used to specify allocation and usage information about the image memory object being created and is described in the supported memory flag values table.
-
image_format is a pointer to a structure that describes format properties of the image to be allocated. A 1D image buffer or 2D image can be created from a buffer by specifying a buffer object in the image_desc→mem_object. A 2D image can be created from another 2D image object by specifying an image object in the image_desc→mem_object. Refer to the Image Format Descriptor section for a detailed description of the image format descriptor.
-
image_desc is a pointer to a structure that describes type and dimensions of the image to be allocated. Refer to the Image Descriptor section for a detailed description of the image descriptor.
-
host_ptr is a pointer to the image data that may already be allocated by the application. Refer to the table below for a description of how large the buffer that host_ptr points to must be.
-
errcode_ret will return an appropriate error code. If errcode_ret is
NULL
, no error code is returned.
The alignment requirements for data stored in image objects are described in Alignment of Application Data Types.
For all image types except CL_MEM_
, if the value
specified for flags is 0, the default is used which is CL_MEM_
.
For CL_MEM_
image type, or an image created from
another memory object (image or buffer), if the CL_MEM_
,
CL_MEM_
or CL_MEM_
values are not specified in flags,
they are inherited from the corresponding memory access qualifiers associated
with mem_object.
The CL_MEM_
, CL_MEM_
and CL_MEM_
values cannot be specified in flags but are inherited from the
corresponding memory access qualifiers associated with mem_object.
If CL_MEM_
is specified in the memory access qualifier values
associated with mem_object it does not imply any additional copies when
the image is created from mem_object.
If the CL_MEM_
, CL_MEM_
or
CL_MEM_
values are not specified in flags, they are
inherited from the corresponding memory access qualifiers associated with
mem_object.
For a 3D image or 2D image array, the image data specified by host_ptr is stored as a linear sequence of adjacent 2D image slices or 2D images respectively. Each 2D image is a linear sequence of adjacent scanlines. Each scanline is a linear sequence of image elements.
For a 2D image, the image data specified by host_ptr is stored as a linear sequence of adjacent scanlines. Each scanline is a linear sequence of image elements.
For a 1D image array, the image data specified by host_ptr is stored as a linear sequence of adjacent 1D images. Each 1D image is stored as a single scanline which is a linear sequence of adjacent elements.
For 1D image or 1D image buffer, the image data specified by host_ptr is stored as a single scanline which is a linear sequence of adjacent elements.
Image elements are stored according to their image format as described in the Image Format Descriptor section.
clCreateImage and clCreateImageWithProperties returns a valid non-zero
image object and errcode_ret is set to CL_SUCCESS
if the image object
is created successfully.
Otherwise, they return a NULL
value with one of the following error values
returned in errcode_ret:
-
CL_INVALID_
if context is not a valid context.CONTEXT -
CL_INVALID_
if a property name in properties is not a supported property name, if the value specified for a supported property name is not valid, or if the same property name is specified more than once.PROPERTY -
CL_INVALID_
if values specified in flags are not valid.VALUE -
CL_INVALID_
if values specified in image_format are not valid or if image_format isIMAGE_ FORMAT_ DESCRIPTOR NULL
. -
CL_INVALID_
if a 2D image is created from a buffer and the row pitch and base address alignment does not follow the rules described for creating a 2D image from a buffer.IMAGE_ FORMAT_ DESCRIPTOR -
CL_INVALID_
if a 2D image is created from a 2D image object and the rules described above are not followed.IMAGE_ FORMAT_ DESCRIPTOR -
CL_INVALID_
if values specified in image_desc are not valid or if image_desc isIMAGE_ DESCRIPTOR NULL
. -
CL_INVALID_
if image dimensions specified in image_desc exceed the maximum image dimensions described in the Device Queries table for all devices in context.IMAGE_ SIZE -
CL_INVALID_
if host_ptr isHOST_ PTR NULL
andCL_MEM_
orUSE_ HOST_ PTR CL_MEM_
are set in flags or if host_ptr is notCOPY_ HOST_ PTR NULL
butCL_MEM_
orCOPY_ HOST_ PTR CL_MEM_
are not set in flags.USE_ HOST_ PTR -
CL_INVALID_
if an image is being created from another memory object (buffer or image) under one of the following circumstances: 1) mem_object was created withVALUE CL_MEM_
and flags specifiesWRITE_ ONLY CL_MEM_
orREAD_ WRITE CL_MEM_
, 2) mem_object was created withREAD_ ONLY CL_MEM_
and flags specifiesREAD_ ONLY CL_MEM_
orREAD_ WRITE CL_MEM_
, 3) flags specifiesWRITE_ ONLY CL_MEM_
orUSE_ HOST_ PTR CL_MEM_
orALLOC_ HOST_ PTR CL_MEM_
.COPY_ HOST_ PTR -
CL_INVALID_
if an image is being created from another memory object (buffer or image) and mem_object was created withVALUE CL_MEM_
and flags specifiesHOST_ WRITE_ ONLY CL_MEM_
, or if mem_object was created withHOST_ READ_ ONLY CL_MEM_
and flags specifiesHOST_ READ_ ONLY CL_MEM_
, or if mem_object was created withHOST_ WRITE_ ONLY CL_MEM_
and_flags_ specifiesHOST_ NO_ ACCESS CL_MEM_
orHOST_ READ_ ONLY CL_MEM_
.HOST_ WRITE_ ONLY -
CL_IMAGE_
if there are no devices in context that support image_format.FORMAT_ NOT_ SUPPORTED -
CL_MEM_
if there is a failure to allocate memory for image object.OBJECT_ ALLOCATION_ FAILURE -
CL_INVALID_
if there are no devices in context that support images (i.e.OPERATION CL_DEVICE_
specified in the Device Queries table isIMAGE_ SUPPORT CL_FALSE
). -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
Image Type | Size of buffer that host_ptr points to |
---|---|
Missing before version 1.2. |
≥ image_row_pitch |
Missing before version 1.2. |
≥ image_row_pitch |
≥ image_row_pitch × image_height |
|
≥ image_slice_pitch × image_depth |
|
Missing before version 1.2. |
≥ image_slice_pitch × image_array_size |
Missing before version 1.2. |
≥ image_slice_pitch × image_array_size |
A 2D image object can be created using the following function
cl_mem clCreateImage2D(
cl_context context,
cl_mem_flags flags,
const cl_image_format* image_format,
size_t image_width,
size_t image_height,
size_t image_row_pitch,
void* host_ptr,
cl_int* errcode_ret);
clCreateImage2D is deprecated by version 1.2. |
-
context is a valid OpenCL context on which the image object is to be created.
-
flags is a bit-field that is used to specify allocation and usage information about the image memory object being created and is described in the supported memory flag values table. If the value specified for flags is 0, the default is used which is
CL_MEM_
.READ_ WRITE -
image_format is a pointer to a structure that describes format properties of the image to be allocated. Refer to the Image Format Descriptor section for a detailed description of the image format descriptor.
-
image_width and image_height are the width and height of the image in pixels. These must be values greater than or equal to 1.
-
image_row_pitch is the scan-line pitch in bytes. This must be 0 if host_ptr is
NULL
and can be either 0 or ≥ image_width × size of element in bytes if host_ptr is notNULL
. If host_ptr is notNULL
and image_row_pitch is 0, image_row_pitch is calculated as image_width × size of element in bytes. If image_row_pitch is not 0, it must be a multiple of the image element size in bytes. -
host_ptr is a pointer to the image data that may already be allocated by the application. Refer to the
CL_MEM_
entry in the required host_ptr buffer size table for a description of how large the buffer that host_ptr points to must be. The image data specified by host_ptr is stored as a linear sequence of adjacent scanlines. Each scanline is a linear sequence of image elements. Image elements are stored according to their image format as described in the Image Format Descriptor section.OBJECT_ IMAGE2D -
errcode_ret will return an appropriate error code. If errcode_ret is
NULL
, no error code is returned.
clCreateImage2D returns a valid non-zero image object created and the
errcode_ret is set to CL_SUCCESS
if the image object is created
successfully.
Otherwise, it returns a NULL
value with one of the following error values
returned in errcode_ret:
-
CL_INVALID_
if context is not a valid context.CONTEXT -
CL_INVALID_
if values specified in flags are not valid.VALUE -
CL_INVALID_
if values specified in image_format are not valid or if image_format isIMAGE_ FORMAT_ DESCRIPTOR NULL
. -
CL_INVALID_
if image_width or image_height are 0 or if they exceed the maximum values specified inIMAGE_ SIZE CL_DEVICE_
orIMAGE2D_ MAX_ WIDTH CL_DEVICE_
respectively for all devices in context or if values specified by image_row_pitch do not follow rules described in the argument description above.IMAGE2D_ MAX_ HEIGHT -
CL_INVALID_
if host_ptr isHOST_ PTR NULL
andCL_MEM_
orUSE_ HOST_ PTR CL_MEM_
are set in flags or if host_ptr is notCOPY_ HOST_ PTR NULL
butCL_MEM_
orCOPY_ HOST_ PTR CL_MEM_
are not set in flags.USE_ HOST_ PTR -
CL_IMAGE_
if there are no devices in context that support image_format.FORMAT_ NOT_ SUPPORTED -
CL_MEM_
if there is a failure to allocate memory for image object.OBJECT_ ALLOCATION_ FAILURE -
CL_INVALID_
if there are no devices in context that support images (i.e.OPERATION CL_DEVICE_
specified in the Device Queries table isIMAGE_ SUPPORT CL_FALSE
). -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
A 3D image object can be created using the following function
cl_mem clCreateImage3D(
cl_context context,
cl_mem_flags flags,
const cl_image_format* image_format,
size_t image_width,
size_t image_height,
size_t image_depth,
size_t image_row_pitch,
size_t image_slice_pitch,
void* host_ptr,
cl_int* errcode_ret);
clCreateImage3D is deprecated by version 1.2. |
-
context is a valid OpenCL context on which the image object is to be created.
-
flags is a bit-field that is used to specify allocation and usage information about the image memory object being created and is described in the supported memory flag values table. If the value specified for flags is 0, the default is used which is
CL_MEM_
.READ_ WRITE -
image_format is a pointer to a structure that describes format properties of the image to be allocated. Refer to the Image Format Descriptor section for a detailed description of the image format descriptor.
-
image_width and image_height are the width and height of the image in pixels. These must be values greater than or equal to 1.
-
image_depth is the depth of the image in pixels. For clCreateImage3D, this must be a value > 1.
-
image_row_pitch is the scan-line pitch in bytes. This must be 0 if host_ptr is
NULL
and can be either 0 or ≥ image_width × size of element in bytes if host_ptr is notNULL
. If host_ptr is notNULL
and image_row_pitch is 0, image_row_pitch is calculated as image_width × size of element in bytes. If image_row_pitch is not 0, it must be a multiple of the image element size in bytes. -
image_slice_pitch is the size in bytes of each 2D slice in the 3D image. This must be 0 if host_ptr is
NULL
and can be 0 or ≥ image_row_pitch × image_height if host_ptr is notNULL
. If host_ptr is notNULL
and image_slice_pitch is 0, image_slice_pitch is calculated as image_row_pitch × image_height. If image_slice_pitch is not 0, it must be a multiple of the image_row_pitch. -
host_ptr is a pointer to the image data that may already be allocated by the application. Refer to the
CL_MEM_
entry in the required host_ptr buffer size table for a description of how large the buffer that host_ptr points to must be. The image data specified by host_ptr is stored as a linear sequence of adjacent 2D slices. Each scanline is a linear sequence of image elements. Image elements are stored according to their image format as described in the Image Format Descriptor section.OBJECT_ IMAGE3D -
errcode_ret will return an appropriate error code. If errcode_ret is
NULL
, no error code is returned.
clCreateImage3D returns a valid non-zero image object created and the
errcode_ret is set to CL_SUCCESS
if the image object is created
successfully.
Otherwise, it returns a NULL
value with one of the following error values
returned in errcode_ret:
-
CL_INVALID_
if context is not a valid context.CONTEXT -
CL_INVALID_
if values specified in flags are not valid.VALUE -
CL_INVALID_
if values specified in image_format are not valid or if image_format isIMAGE_ FORMAT_ DESCRIPTOR NULL
. -
CL_INVALID_
if image_width or image_height are 0 or if image_depth ≤ 1, or if they exceed the maximum values specified inIMAGE_ SIZE CL_DEVICE_
,IMAGE3D_ MAX_ WIDTH CL_DEVICE_
orIMAGE3D_ MAX_ HEIGHT CL_DEVICE_
respectively for all devices in context, or if values specified by image_row_pitch and image_slice_pitch do not follow rules described in the argument description above.IMAGE3D_ MAX_ DEPTH -
CL_INVALID_
if host_ptr isHOST_ PTR NULL
andCL_MEM_
orUSE_ HOST_ PTR CL_MEM_
are set in flags or if host_ptr is notCOPY_ HOST_ PTR NULL
butCL_MEM_
orCOPY_ HOST_ PTR CL_MEM_
are not set in flags.USE_ HOST_ PTR -
CL_IMAGE_
if there are no devices in context that support image_format.FORMAT_ NOT_ SUPPORTED -
CL_MEM_
if there is a failure to allocate memory for image object.OBJECT_ ALLOCATION_ FAILURE -
CL_INVALID_
if there are no devices in context that support images (i.e.OPERATION CL_DEVICE_
specified in the Device Queries table isIMAGE_ SUPPORT CL_FALSE
). -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
5.3.1.1. Image Format Descriptor
The cl_image_
image format descriptor structure describes an image
format, and is defined as:
typedef struct cl_image_format {
cl_channel_order image_channel_order;
cl_channel_type image_channel_data_type;
} cl_image_format;
-
image_channel_order
specifies the number of channels and the channel layout i.e. the memory layout in which channels are stored in the image. Valid values are described in the Image Channel Order table. -
image_channel_data_type
describes the size of the channel data type. The list of supported values is described in the Image Channel Data Types table. The number of bits per element determined by theimage_channel_data_type
andimage_channel_order
must be a power of two.
Image Channel Order | Description |
---|---|
Single channel image formats where the single channel represents a |
|
Missing before version 2.0. |
A single channel image format where the single channel represents a |
A single channel image format where the single channel represents a |
|
A single channel image format where the single channel represents an |
|
Two channel image formats.
The first channel always represents a |
|
Missing before version 1.1. |
A two channel image format, where the first channel represents a |
A three channel image format, where the three channels represent |
|
Missing before version 1.1. |
A three channel image format, where the first two channels represent |
|
Four channel image formats, where the four channels represent |
Missing before version 1.1. |
A four channel image format, where the first three channels represent |
Missing before version 2.0. |
A three channel image format, where the three channels represent |
Missing before version 2.0. |
Four channel image formats, where the first three channels represent |
Missing before version 2.0. |
A four channel image format, where the three channels represent |
Image Channel Data Type | Description |
---|---|
Each channel component is a normalized signed 8-bit integer value |
|
Each channel component is a normalized signed 16-bit integer value |
|
Each channel component is a normalized unsigned 8-bit integer value |
|
Each channel component is a normalized unsigned 16-bit integer value |
|
Represents a normalized 5-6-5 3-channel RGB image.
The channel order must be |
|
Represents a normalized x-5-5-5 4-channel xRGB image.
The channel order must be |
|
Represents a normalized x-10-10-10 4-channel xRGB image.
The channel order must be |
|
Missing before version 2.1. |
Represents a normalized 10-10-10-2 four-channel RGBA image.
The channel order must be |
Each channel component is an unnormalized signed 8-bit integer value |
|
Each channel component is an unnormalized signed 16-bit integer value |
|
Each channel component is an unnormalized signed 32-bit integer value |
|
Each channel component is an unnormalized unsigned 8-bit integer value |
|
Each channel component is an unnormalized unsigned 16-bit integer value |
|
Each channel component is an unnormalized unsigned 32-bit integer value |
|
Each channel component is a 16-bit half-float value |
|
Each channel component is a single precision floating-point value |
For example, to specify a normalized unsigned 8-bit / channel RGBA image,
image_channel_order
= CL_RGBA
, and image_channel_data_type
=
CL_UNORM_
.
The memory layout of this image format is described below:
R |
G |
B |
A |
… |
with the corresponding byte offsets
0 |
1 |
2 |
3 |
… |
Similar, if image_channel_order
= CL_RGBA
and image_channel_data_type
=
CL_SIGNED_
, the memory layout of this image format is described below:
R |
G |
B |
A |
… |
with the corresponding byte offsets
0 |
2 |
4 |
6 |
… |
image_channel_data_type
values of CL_UNORM_
, CL_UNORM_
,
CL_UNORM_
, and CL_UNORM_
are special cases of packed
image formats where the channels of each element are packed into a single
unsigned short or unsigned int.
For these special packed image formats, the channels are normally packed
with the first channel in the most significant bits of the bitfield, and
successive channels occupying progressively less significant locations.
For CL_UNORM_
, R is in bits 15:11, G is in bits 10:5 and B is in
bits 4:0.
For CL_UNORM_
, bit 15 is undefined, R is in bits 14:10, G in bits
9:5 and B in bits 4:0.
For CL_UNORM_
, bits 31:30 are undefined, R is in bits 29:20, G in
bits 19:10 and B in bits 9:0.
For CL_UNORM_
, R is in bits 31:22, G in bits 21:12, B in bits
11:2 and A in bits 1:0.
OpenCL implementations must maintain the minimum precision specified by the
number of bits in image_channel_data_type
.
If the image format specified by image_channel_order
, and
image_channel_data_type
cannot be supported by the OpenCL implementation,
then the call to clCreateImage, clCreateImageWithProperties,
clCreateImage2D, or clCreateImage3D will return a NULL
memory object.
5.3.1.2. Image Descriptor
The cl_image_
image descriptor structure describes the image type
and dimensions of an image or image array when creating an image using
clCreateImage or clCreateImageWithProperties, and is defined as:
typedef struct cl_image_desc {
cl_mem_object_type image_type;
size_t image_width;
size_t image_height;
size_t image_depth;
size_t image_array_size;
size_t image_row_pitch;
size_t image_slice_pitch;
cl_uint num_mip_levels;
cl_uint num_samples;
union {
cl_mem buffer;
cl_mem mem_object;
};
} cl_image_desc;
-
image_type
describes the image type and must be eitherCL_MEM_
,OBJECT_ IMAGE1D CL_MEM_
,OBJECT_ IMAGE1D_ BUFFER CL_MEM_
,OBJECT_ IMAGE1D_ ARRAY CL_MEM_
,OBJECT_ IMAGE2D CL_MEM_
, orOBJECT_ IMAGE2D_ ARRAY CL_MEM_
.OBJECT_ IMAGE3D -
image_width
is the width of the image in pixels. For a 2D image and image array, the image width must be a value ≥ 1 and ≤CL_DEVICE_
. For a 3D image, the image width must be a value ≥ 1 and ≤IMAGE2D_ MAX_ WIDTH CL_DEVICE_
. For a 1D image buffer, the image width must be a value ≥ 1 and ≤IMAGE3D_ MAX_ WIDTH CL_DEVICE_
. For a 1D image and 1D image array, the image width must be a value ≥ 1 and ≤IMAGE_ MAX_ BUFFER_ SIZE CL_DEVICE_
.IMAGE2D_ MAX_ WIDTH -
image_height
is the height of the image in pixels. This is only used if the image is a 2D or 3D image, or a 2D image array. For a 2D image or image array, the image height must be a value ≥ 1 and ≤CL_DEVICE_
. For a 3D image, the image height must be a value ≥ 1 and ≤IMAGE2D_ MAX_ HEIGHT CL_DEVICE_
.IMAGE3D_ MAX_ HEIGHT -
image_depth
is the depth of the image in pixels. This is only used if the image is a 3D image and must be a value ≥ 1 and ≤CL_DEVICE_
.IMAGE3D_ MAX_ DEPTH -
image_array_size
[17] is the number of images in the image array. This is only used if the image is a 1D or 2D image array. The values forimage_array_size
, if specified, must be a value ≥ 1 and ≤CL_DEVICE_
.IMAGE_ MAX_ ARRAY_ SIZE -
image_row_pitch
is the scan-line pitch in bytes. This must be 0 if host_ptr isNULL
and can be either 0 or ≥image_width
× size of element in bytes if host_ptr is notNULL
. If host_ptr is notNULL
andimage_row_pitch
= 0,image_row_pitch
is calculated asimage_width
× size of element in bytes. Ifimage_row_pitch
is not 0, it must be a multiple of the image element size in bytes. For a 2D image created from a buffer, the pitch specified (or computed if pitch specified is 0) must be a multiple of the maximum of theCL_DEVICE_
value for all devices in the context associated with the buffer specified byIMAGE_ PITCH_ ALIGNMENT mem_object
that support images. -
image_slice_pitch
is the size in bytes of each 2D slice in the 3D image or the size in bytes of each image in a 1D or 2D image array. This must be 0 if host_ptr isNULL
. If host_ptr is notNULL
,image_slice_pitch
can be either 0 or ≥image_row_pitch
×image_height
for a 2D image array or 3D image and can be either 0 or ≥image_row_pitch
for a 1D image array. If host_ptr is notNULL
andimage_slice_pitch
= 0,image_slice_pitch
is calculated asimage_row_pitch
×image_height
for a 2D image array or 3D image andimage_row_pitch
for a 1D image array. Ifimage_slice_pitch
is not 0, it must be a multiple of theimage_row_pitch
. -
num_mip_levels
andnum_samples
must be 0. -
mem_object
may refer to a valid buffer or image memory object.mem_object
can be a buffer memory object ifimage_type
isCL_MEM_
orOBJECT_ IMAGE1D_ BUFFER CL_MEM_
[18].OBJECT_ IMAGE2D mem_object
can be an image object ifimage_type
isCL_MEM_
[19]. Otherwise it must beOBJECT_ IMAGE2D NULL
. The image pixels are taken from the memory objects data store. When the contents of the specified memory objects data store are modified, those changes are reflected in the contents of the image object and vice-versa at corresponding synchronization points.
For a 1D image buffer created from a buffer object, the image_width
×
size of element in bytes must be ≤ size of the buffer object.
The image data in the buffer object is stored as a single scanline which is
a linear sequence of adjacent elements.
For a 2D image created from a buffer object, the image_row_pitch
×
image_height
must be ≤ size of the buffer object specified by
mem_object
.
The image data in the buffer object is stored as a linear sequence of
adjacent scanlines.
Each scanline is a linear sequence of image elements padded to
image_row_pitch
bytes.
For an image object created from another image object, the values specified
in the image descriptor except for mem_object
must match the image
descriptor information associated with mem_object
.
Image elements are stored according to their image format as described in Image Format Descriptor.
If the buffer object specified by mem_object
was created with
CL_MEM_
, the host_ptr specified to clCreateBuffer or
clCreateBufferWithProperties must be aligned to the maximum of the
CL_DEVICE_
value for all devices in the
context associated with the buffer specified by mem_object
that
support images.
Creating a 2D image object from another 2D image object creates a new
2D image object that shares the image data store with mem_object
but views
the pixels in the image with a different image channel order.
Restrictions are:
-
All of the values specified in image_desc must match the image descriptor information associated with
mem_object
, except formem_object
. -
The image channel data type specified in image_format must match the image channel data type associated with
mem_object
. -
The image channel order specified in image_format must be compatible with the image channel order associated with
mem_object
. Compatible image channel orders [20] are:
Image Channel Order in image_format: | Image Channel Order associated with mem_object : |
---|---|
Concurrent reading from, writing to and copying between both a buffer object and 1D image buffer or 2D image object associated with the buffer object is undefined. Only reading from both a buffer object and 1D image buffer or 2D image object associated with the buffer object is defined. Writing to an image created from a buffer and then reading from this buffer in a kernel even if appropriate synchronization operations (such as a barrier) are performed between the writes and reads is undefined. Similarly, writing to the buffer and reading from the image created from this buffer with appropriate synchronization between the writes and reads is undefined. |
5.3.2. Querying List of Supported Image Formats
To get the list of image formats supported by an OpenCL implementation for a specified context, image type, and allocation information, call the function
cl_int clGetSupportedImageFormats(
cl_context context,
cl_mem_flags flags,
cl_mem_object_type image_type,
cl_uint num_entries,
cl_image_format* image_formats,
cl_uint* num_image_formats);
-
context is a valid OpenCL context on which the image object(s) will be created.
-
flags is a bit-field that is used to specify usage information about the image formats being queried and is described in the Memory Flags table. flags may be
CL_MEM_
to query image formats that may be read from and written to by different kernel instances when correctly ordered by event dependencies, orREAD_ WRITE CL_MEM_
to query image formats that may be read from by a kernel, orREAD_ ONLY CL_MEM_
to query image formats that may be written to by a kernel, orWRITE_ ONLY CL_MEM_
to query image formats that may be both read from and written to by the same kernel instance. Please see Image Format Mapping for clarification.KERNEL_ READ_ AND_ WRITE -
image_type describes the image type and must be either
CL_MEM_
,OBJECT_ IMAGE1D CL_MEM_
,OBJECT_ IMAGE1D_ BUFFER CL_MEM_
,OBJECT_ IMAGE2D CL_MEM_
,OBJECT_ IMAGE3D CL_MEM_
, orOBJECT_ IMAGE1D_ ARRAY CL_MEM_
.OBJECT_ IMAGE2D_ ARRAY -
num_entries specifies the number of entries that can be returned in the memory location given by image_formats.
-
image_formats is a pointer to a memory location where the list of supported image formats are returned. Each entry describes a
cl_image_
structure supported by the OpenCL implementation. If image_formats isformat NULL
, it is ignored. -
num_image_formats is the actual number of supported image formats for a specific context and values specified by flags. If num_image_formats is
NULL
, it is ignored.
clGetSupportedImageFormats returns a union of image formats supported by all devices in the context.
clGetSupportedImageFormats returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if context is not a valid context.CONTEXT -
CL_INVALID_
if flags or image_type are not valid, or if num_entries is 0 and image_formats is notVALUE NULL
. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
If CL_DEVICE_
specified in the Device
Queries table is CL_TRUE
, the values assigned to
CL_DEVICE_
, CL_DEVICE_
CL_DEVICE_
, CL_DEVICE_
CL_DEVICE_
, CL_DEVICE_
CL_DEVICE_
, and CL_DEVICE_
by the implementation
must be greater than or equal to the minimum values specified in the
Device Queries table.
5.3.2.1. Minimum List of Supported Image Formats
The tables below describe the required minimum lists of supported image formats. To query all image formats supported by an implementation, call the function clGetSupportedImageFormats.
For full profile devices supporting OpenCL 2.0, 2.1, or 2.2, the minimum list of supported image formats for either reading or writing in a kernel is:
num_channels | channel_order | channel_data_type |
---|---|---|
1 |
|
|
1 |
||
2 |
|
|
4 |
|
|
4 |
||
4 |
For full profile devices supporting other OpenCL versions, such as OpenCL 1.2 or OpenCL 3.0, the minimum list of supported image formats for either reading or writing in a kernel is:
num_channels | channel_order | channel_data_type |
---|---|---|
4 |
|
|
4 |
For full profile devices that support reading from and writing to the same
image object from the same kernel instance (see CL_DEVICE_
),
the minimum list of supported image formats for reading and writing in
the same kernel instance is:
num_channels | channel_order | channel_data_type |
---|---|---|
1 |
|
|
4 |
|
5.3.2.2. Image format mapping to OpenCL kernel language image access qualifiers
Image arguments to kernels may have the read_only
, write_only
or
read_write
qualifier.
Not all image formats supported by the device and platform are valid to be
passed to all of these access qualifiers.
For each access qualifier, only images whose format is in the list of
formats returned by clGetSupportedImageFormats with the given flag
arguments in the Image Format Mapping table
are permitted.
It is not valid to pass an image supporting writing as both a read_only
image and a write_only
image parameter, or to a read_write
image
parameter and any other image parameter.
Access Qualifier | Memory Flags |
---|---|
|
|
|
|
|
5.3.3. Reading, Writing and Copying Image Objects
The following functions enqueue commands to read from an image or image array object to host memory or write to an image or image array object from host memory.
cl_int clEnqueueReadImage(
cl_command_queue command_queue,
cl_mem image,
cl_bool blocking_read,
const size_t* origin,
const size_t* region,
size_t row_pitch,
size_t slice_pitch,
void* ptr,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
cl_int clEnqueueWriteImage(
cl_command_queue command_queue,
cl_mem image,
cl_bool blocking_write,
const size_t* origin,
const size_t* region,
size_t input_row_pitch,
size_t input_slice_pitch,
const void* ptr,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
-
command_queue refers to the host command-queue in which the read / write command will be queued. command_queue and image must be created with the same OpenCL context.
-
image refers to a valid image or image array object.
-
blocking_read and blocking_write indicate if the read and write operations are blocking or non-blocking.
-
origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y) offset and the image index in the 2D image array or the (x) offset and the image index in the 1D image array. If image is a 2D image object, origin[2] must be 0. If image is a 1D image or 1D image buffer object, origin[1] and origin[2] must be 0. If image is a 1D image array object, origin[2] must be 0. If image is a 1D image array object, origin[1] describes the image index in the 1D image array. If image is a 2D image array object, origin[2] describes the image index in the 2D image array.
-
region defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the (width, height) in pixels of the 2D rectangle and the number of images of a 2D image array or the (width) in pixels of the 1D rectangle and the number of images of a 1D image array. If image is a 2D image object, region[2] must be 1. If image is a 1D image or 1D image buffer object, region[1] and region[2] must be 1. If image is a 1D image array object, region[2] must be 1. The values in region cannot be 0.
-
row_pitch in clEnqueueReadImage and input_row_pitch in clEnqueueWriteImage is the length of each row in bytes. This value must be greater than or equal to the element size in bytes × width. If row_pitch (or input_row_pitch) is set to 0, the appropriate row pitch is calculated based on the size of each element in bytes multiplied by width.
-
slice_pitch in clEnqueueReadImage and input_slice_pitch in clEnqueueWriteImage is the size in bytes of the 2D slice of the 3D region of a 3D image or each image of a 1D or 2D image array being read or written respectively. This must be 0 if image is a 1D or 2D image. Otherwise this value must be greater than or equal to row_pitch × height. If slice_pitch (or input_slice_pitch) is set to 0, the appropriate slice pitch is calculated based on the row_pitch × height.
-
ptr is the pointer to a buffer in host memory where image data is to be read from or to be written to. The alignment requirements for ptr are specified in Alignment of Application Data Types.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. If event_wait_list is
NULL
, then this particular command does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this read / write command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array.
If blocking_read is CL_TRUE
i.e. the read command is blocking,
clEnqueueReadImage does not return until the buffer data has been read and
copied into memory pointed to by ptr.
If blocking_read is CL_FALSE
i.e. the read command is non-blocking,
clEnqueueReadImage queues a non-blocking read command and returns.
The contents of the buffer that ptr points to cannot be used until the
read command has completed.
The event argument returns an event object which can be used to query the
execution status of the read command.
When the read command has completed, the contents of the buffer that ptr
points to can be used by the application.
If blocking_write is CL_TRUE
, the write command is blocking and does not
return until the command is complete, including transfer of the data.
The memory pointed to by ptr can be reused by the application after the
clEnqueueWriteImage call returns.
If blocking_write is CL_FALSE
, the OpenCL implementation will use ptr to
perform a non-blocking write.
As the write is non-blocking the implementation can return immediately.
The memory pointed to by ptr cannot be reused by the application after the
call returns.
The event argument returns an event object which can be used to query the
execution status of the write command.
When the write command has completed, the memory pointed to by ptr can
then be reused by the application.
clEnqueueReadImage and clEnqueueWriteImage return CL_SUCCESS
if the
function is executed successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if the context associated with command_queue and image are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if image is not a valid image object.MEM_ OBJECT -
CL_INVALID_
if origin or region isVALUE NULL
. -
CL_INVALID_
if the region being read or written specified by origin and region is out of bounds.VALUE -
CL_INVALID_
if values in origin and region do not follow rules described in the argument description for origin and region.VALUE -
CL_INVALID_
if image is a 1D or 2D image and slice_pitch or input_slice_pitch is not 0.VALUE -
CL_INVALID_
if ptr isVALUE NULL
. -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_INVALID_
if image dimensions (image width, height, specified or compute row and/or slice pitch) for image are not supported by device associated with queue.IMAGE_ SIZE -
CL_IMAGE_
if image format (image channel order and data type) for image are not supported by device associated with queue.FORMAT_ NOT_ SUPPORTED -
CL_MEM_
if there is a failure to allocate memory for data store associated with image.OBJECT_ ALLOCATION_ FAILURE -
CL_INVALID_
if the device associated with command_queue does not support images (i.e.OPERATION CL_DEVICE_
specified in the Device Queries table isIMAGE_ SUPPORT CL_FALSE
). -
CL_INVALID_
if clEnqueueReadImage is called on image which has been created withOPERATION CL_MEM_
orHOST_ WRITE_ ONLY CL_MEM_
.HOST_ NO_ ACCESS -
CL_INVALID_
if clEnqueueWriteImage is called on image which has been created withOPERATION CL_MEM_
orHOST_ READ_ ONLY CL_MEM_
.HOST_ NO_ ACCESS -
CL_EXEC_
if the read and write operations are blocking and the execution status of any of the events in event_wait_list is a negative integer value. This error code is missing before version 1.1.STATUS_ ERROR_ FOR_ EVENTS_ IN_ WAIT_ LIST -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
Calling clEnqueueReadImage to read a region of the image with the ptr
argument value set to host_ptr + (origin[2] × image slice pitch
+ origin[1] × image row pitch + origin[0] × bytes
per pixel), where host_ptr is a pointer to the memory region specified
when the image being read is created with
Calling clEnqueueWriteImage to update the latest bits in a region of the
image with the ptr argument value set to host_ptr + (origin[2]
× image slice pitch + origin[1] × image row pitch +
origin[0] × bytes per pixel), where host_ptr is a pointer to the
memory region specified when the image being written is created with
|
To enqueue a command to copy image objects, call the function
cl_int clEnqueueCopyImage(
cl_command_queue command_queue,
cl_mem src_image,
cl_mem dst_image,
const size_t* src_origin,
const size_t* dst_origin,
const size_t* region,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
-
src_image and dst_image can be 1D, 2D, 3D image or a 1D, 2D image array objects. It is possible to copy subregions between any combinations of source and destination types, provided that the dimensions of the subregions are the same e.g., one can copy a rectangular region from a 2D image to a slice of a 3D image.
-
command_queue refers to the host command-queue in which the copy command will be queued. The OpenCL context associated with command_queue, src_image and dst_image must be the same.
-
src_origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y) offset and the image index in the 2D image array or the (x) offset and the image index in the 1D image array. If image is a 2D image object, src_origin[2] must be 0. If src_image is a 1D image object, src_origin[1] and src_origin[2] must be 0. If src_image is a 1D image array object, src_origin[2] must be 0. If src_image is a 1D image array object, src_origin[1] describes the image index in the 1D image array. If src_image is a 2D image array object, src_origin[2] describes the image index in the 2D image array.
-
dst_origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y) offset and the image index in the 2D image array or the (x) offset and the image index in the 1D image array. If dst_image is a 2D image object, dst_origin[2] must be 0. If dst_image is a 1D image or 1D image buffer object, dst_origin[1] and dst_origin[2] must be 0. If dst_image is a 1D image array object, dst_origin[2] must be 0. If dst_image is a 1D image array object, dst_origin[1] describes the image index in the 1D image array. If dst_image is a 2D image array object, dst_origin[2] describes the image index in the 2D image array.
-
region defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the (width, height) in pixels of the 2D rectangle and the number of images of a 2D image array or the (width) in pixels of the 1D rectangle and the number of images of a 1D image array. If src_image or dst_image is a 2D image object, region[2] must be 1. If src_image or dst_image is a 1D image or 1D image buffer object, region[1] and region[2] must be 1. If src_image or dst_image is a 1D image array object, region[2] must be 1. The values in region cannot be 0.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. If event_wait_list is
NULL
, then this particular command does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this copy command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array.
It is currently a requirement that the src_image and dst_image image
memory objects for clEnqueueCopyImage must have the exact same image
format (i.e. the cl_image_
descriptor specified when src_image and
dst_image are created must match).
clEnqueueCopyImage returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if the context associated with command_queue, src_image and dst_image are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if src_image and dst_image are not valid image objects.MEM_ OBJECT -
CL_IMAGE_
if src_image and dst_image do not use the same image format.FORMAT_ MISMATCH -
CL_INVALID_
if src_origin, dst_origin, or region isVALUE NULL
. -
CL_INVALID_
if the 2D or 3D rectangular region specified by src_origin and src_origin + region refers to a region outside src_image, or if the 2D or 3D rectangular region specified by dst_origin and dst_origin + region refers to a region outside dst_image.VALUE -
CL_INVALID_
if values in src_origin, dst_origin and region do not follow rules described in the argument description for src_origin, dst_origin and region.VALUE -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_INVALID_
if image dimensions (image width, height, specified or compute row and/or slice pitch) for src_image or dst_image are not supported by device associated with queue.IMAGE_ SIZE -
CL_IMAGE_
if image format (image channel order and data type) for src_image or dst_image are not supported by device associated with queue.FORMAT_ NOT_ SUPPORTED -
CL_MEM_
if there is a failure to allocate memory for data store associated with src_image or dst_image.OBJECT_ ALLOCATION_ FAILURE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY -
CL_INVALID_
if the device associated with command_queue does not support images (i.e.OPERATION CL_DEVICE_
specified in the Device Queries table isIMAGE_ SUPPORT CL_FALSE
). -
CL_MEM_
if src_image and dst_image are the same image object and the source and destination regions overlap.COPY_ OVERLAP
5.3.4. Filling Image Objects
Filling image objects is missing before version 1.2. |
To enqueue a command to fill an image object with a specified color, call the function
cl_int clEnqueueFillImage(
cl_command_queue command_queue,
cl_mem image,
const void* fill_color,
const size_t* origin,
const size_t* region,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
clEnqueueFillImage is missing before version 1.2. |
-
command_queue refers to the host command-queue in which the fill command will be queued. The OpenCL context associated with command_queue and image must be the same.
-
image is a valid image object.
-
fill_color is the color used to fill the image. The fill color is a single floating point value if the channel order is
CL_DEPTH
. Otherwise, the fill color is a four component RGBA floating-point color value if the image channel data type is not an unnormalized signed or unsigned integer type, is a four component signed integer value if the image channel data type is an unnormalized signed integer type and is a four component unsigned integer value if the image channel data type is an unnormalized unsigned integer type. The fill color will be converted to the appropriate image channel format and order associated with image. -
origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y) offset and the image index in the 2D image array or the (x) offset and the image index in the 1D image array. If image is a 2D image object, origin[2] must be 0. If image is a 1D image or 1D image buffer object, origin[1] and origin[2] must be 0. If image is a 1D image array object, origin[2] must be 0. If image is a 1D image array object, origin[1] describes the image index in the 1D image array. If image is a 2D image array object, origin[2] describes the image index in the 2D image array.
-
region defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the (width, height) in pixels of the 2D rectangle and the number of images of a 2D image array or the (width) in pixels of the 1D rectangle and the number of images of a 1D image array. If image is a 2D image object, region[2] must be 1. If image is a 1D image or 1D image buffer object, region[1] and region[2] must be 1. If image is a 1D image array object, region[2] must be 1. The values in region cannot be 0.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. If event_wait_list is
NULL
, then this particular command does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array.
The usage information which indicates whether the memory object can be read
or written by a kernel and/or the host and is given by the cl_mem_
argument value specified when image is created is ignored by
clEnqueueFillImage.
clEnqueueFillImage returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if the context associated with command_queue and image are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if image is not a valid image object.MEM_ OBJECT -
CL_INVALID_
if fill_color isVALUE NULL
. -
CL_INVALID_
if origin or region isVALUE NULL
. -
CL_INVALID_
if the region being filled as specified by origin and region is out of bounds.VALUE -
CL_INVALID_
if values in origin and region do not follow rules described in the argument description for origin and region.VALUE -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_INVALID_
if image dimensions (image width, height, specified or compute row and/or slice pitch) for image are not supported by device associated with queue.IMAGE_ SIZE -
CL_IMAGE_
if image format (image channel order and data type) for image are not supported by device associated with queue.FORMAT_ NOT_ SUPPORTED -
CL_MEM_
if there is a failure to allocate memory for data store associated with image.OBJECT_ ALLOCATION_ FAILURE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
5.3.5. Copying between Image and Buffer Objects
To enqueue a command to copy an image object to a buffer object, call the function
cl_int clEnqueueCopyImageToBuffer(
cl_command_queue command_queue,
cl_mem src_image,
cl_mem dst_buffer,
const size_t* src_origin,
const size_t* region,
size_t dst_offset,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
-
command_queue must be a valid host command-queue. The OpenCL context associated with command_queue, src_image and dst_buffer must be the same.
-
src_image is a valid image object.
-
dst_buffer is a valid buffer object.
-
src_origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y) offset and the image index in the 2D image array or the (x) offset and the image index in the 1D image array. If src_image is a 2D image object, src_origin[2] must be 0. If src_image is a 1D image or 1D image buffer object, src_origin[1] and src_origin[2] must be 0. If src_image is a 1D image array object, src_origin[2] must be 0. If src_image is a 1D image array object, src_origin[1] describes the image index in the 1D image array. If src_image is a 2D image array object, src_origin[2] describes the image index in the 2D image array.
-
region defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the (width, height) in pixels of the 2D rectangle and the number of images of a 2D image array or the (width) in pixels of the 1D rectangle and the number of images of a 1D image array. If src_image is a 2D image object, region[2] must be 1. If src_image is a 1D image or 1D image buffer object, region[1] and region[2] must be 1. If src_image is a 1D image array object, region[2] must be 1. The values in region cannot be 0.
-
dst_offset refers to the offset where to begin copying data into dst_buffer. The size in bytes of the region to be copied referred to as dst_cb is computed as width × height × depth × bytes/image element if src_image is a 3D image object, is computed as width × height × bytes/image element if src_image is a 2D image, is computed as width × height × arraysize × bytes/image element if src_image is a 2D image array object, is computed as width × bytes/image element if src_image is a 1D image or 1D image buffer object and is computed as width × arraysize × bytes/image element if src_image is a 1D image array object.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. If event_wait_list is
NULL
, then this particular command does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this copy command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array.
clEnqueueCopyImageToBuffer returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if the context associated with command_queue, src_image and dst_buffer are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if src_image is not a valid image object or dst_buffer is not a valid buffer object or if src_image is a 1D image buffer object created from dst_buffer.MEM_ OBJECT -
CL_INVALID_
if src_origin or region isVALUE NULL
. -
CL_INVALID_
if the 1D, 2D or 3D rectangular region specified by src_origin and src_origin + region refers to a region outside src_image, or if the region specified by dst_offset and dst_offset + dst_cb to a region outside dst_buffer.VALUE -
CL_INVALID_
if values in src_origin and region do not follow rules described in the argument description for src_origin and region.VALUE -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_MISALIGNED_
if dst_buffer is a sub-buffer object and offset specified when the sub-buffer object is created is not aligned toSUB_ BUFFER_ OFFSET CL_DEVICE_
value for device associated with queue. This error code is missing before version 1.1.MEM_ BASE_ ADDR_ ALIGN -
CL_INVALID_
if image dimensions (image width, height, specified or compute row and/or slice pitch) for src_image are not supported by device associated with queue.IMAGE_ SIZE -
CL_IMAGE_
if image format (image channel order and data type) for src_image are not supported by device associated with queue.FORMAT_ NOT_ SUPPORTED -
CL_MEM_
if there is a failure to allocate memory for data store associated with src_image or dst_buffer.OBJECT_ ALLOCATION_ FAILURE -
CL_INVALID_
if the device associated with command_queue does not support images (i.e.OPERATION CL_DEVICE_
specified in the Device Queries table isIMAGE_ SUPPORT CL_FALSE
). -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
To enqueue a command to copy a buffer object to an image object, call the function
cl_int clEnqueueCopyBufferToImage(
cl_command_queue command_queue,
cl_mem src_buffer,
cl_mem dst_image,
size_t src_offset,
const size_t* dst_origin,
const size_t* region,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
-
command_queue must be a valid host command-queue. The OpenCL context associated with command_queue, src_buffer and dst_image must be the same.
-
src_buffer is a valid buffer object.
-
dst_image is a valid image object.
-
src_offset refers to the offset where to begin copying data from src_buffer.
-
dst_origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y) offset and the image index in the 2D image array or the (x) offset and the image index in the 1D image array. If dst_image is a 2D image object, dst_origin[2] must be 0. If dst_image is a 1D image or 1D image buffer object, dst_origin[1] and dst_origin[2] must be 0. If dst_image is a 1D image array object, dst_origin[2] must be 0. If dst_image is a 1D image array object, dst_origin[1] describes the image index in the 1D image array. If dst_image is a 2D image array object, dst_origin[2] describes the image index in the 2D image array.
-
region defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the (width, height) in pixels of the 2D rectangle and the number of images of a 2D image array or the (width) in pixels of the 1D rectangle and the number of images of a 1D image array. If dst_image is a 2D image object, region[2] must be 1. If dst_image is a 1D image or 1D image buffer object, region[1] and region[2] must be 1. If dst_image is a 1D image array object, region[2] must be 1. The values in region cannot be 0.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. If event_wait_list is
NULL
, then this particular command does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this copy command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array.
The size in bytes of the region to be copied from src_buffer referred to as src_cb is computed as width × height × depth × bytes/image element if dst_image is a 3D image object, is computed as width × height × bytes/image element if dst_image is a 2D image, is computed as width × height × arraysize × bytes/image element if dst_image is a 2D image array object, is computed as width × bytes/image element if dst_image is a 1D image or 1D image buffer object and is computed as width × arraysize × bytes/image element if dst_image is a 1D image array object.
clEnqueueCopyBufferToImage returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if the context associated with command_queue, src_buffer and dst_image are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if src_buffer is not a valid buffer object or dst_image is not a valid image object or if dst_image is a 1D image buffer object created from src_buffer.MEM_ OBJECT -
CL_INVALID_
if dst_origin or region isVALUE NULL
. -
CL_INVALID_
if the 1D, 2D or 3D rectangular region specified by dst_origin and dst_origin + region refer to a region outside dst_image, or if the region specified by src_offset and src_offset + src_cb refer to a region outside src_buffer.VALUE -
CL_INVALID_
if values in dst_origin and region do not follow rules described in the argument description for dst_origin and region.VALUE -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_MISALIGNED_
if src_buffer is a sub-buffer object and offset specified when the sub-buffer object is created is not aligned toSUB_ BUFFER_ OFFSET CL_DEVICE_
value for device associated with queue. This error code is missing before version 1.1.MEM_ BASE_ ADDR_ ALIGN -
CL_INVALID_
if image dimensions (image width, height, specified or compute row and/or slice pitch) for dst_image are not supported by device associated with queue.IMAGE_ SIZE -
CL_IMAGE_
if image format (image channel order and data type) for dst_image are not supported by device associated with queue.FORMAT_ NOT_ SUPPORTED -
CL_MEM_
if there is a failure to allocate memory for data store associated with src_buffer or dst_image.OBJECT_ ALLOCATION_ FAILURE -
CL_INVALID_
if the device associated with command_queue does not support images (i.e.OPERATION CL_DEVICE_
specified in the Device Queries table isIMAGE_ SUPPORT CL_FALSE
). -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
5.3.6. Mapping Image Objects
To enqueue a command to map a region in the image object given by image into the host address space and returns a pointer to this mapped region, call the function
void* clEnqueueMapImage(
cl_command_queue command_queue,
cl_mem image,
cl_bool blocking_map,
cl_map_flags map_flags,
const size_t* origin,
const size_t* region,
size_t* image_row_pitch,
size_t* image_slice_pitch,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event,
cl_int* errcode_ret);
-
command_queue must be a valid host command-queue.
-
image is a valid image object. The OpenCL context associated with command_queue and image must be the same.
-
blocking_map indicates if the map operation is blocking or non-blocking.
-
map_flags is a bit-field and is described in the Memory Map Flags table.
-
origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y) offset and the image index in the 2D image array or the (x) offset and the image index in the 1D image array. If image is a 2D image object, origin[2] must be 0. If image is a 1D image or 1D image buffer object, origin[1] and origin[2] must be 0. If image is a 1D image array object, origin[2] must be 0. If image is a 1D image array object, origin[1] describes the image index in the 1D image array. If image is a 2D image array object, origin[2] describes the image index in the 2D image array.
-
region defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the (width, height) in pixels of the 2D rectangle and the number of images of a 2D image array or the (width) in pixels of the 1D rectangle and the number of images of a 1D image array. If image is a 2D image object, region[2] must be 1. If image is a 1D image or 1D image buffer object, region[1] and region[2] must be 1. If image is a 1D image array object, region[2] must be 1. The values in region cannot be 0.
-
image_row_pitch returns the scan-line pitch in bytes for the mapped region. This must be a non-
NULL
value. -
image_slice_pitch returns the size in bytes of each 2D slice of a 3D image or the size of each 1D or 2D image in a 1D or 2D image array for the mapped region. For a 1D and 2D image, zero is returned if this argument is not
NULL
. For a 3D image, 1D and 2D image array, image_slice_pitch must be a non-NULL
value. -
event_wait_list and num_events_in_wait_list specify events that need to complete before clEnqueueMapImage can be executed. If event_wait_list is
NULL
, then clEnqueueMapImage does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array. -
errcode_ret will return an appropriate error code. If errcode_ret is
NULL
, no error code is returned.
If blocking_map is CL_TRUE
, clEnqueueMapImage does not return until the
specified region in image is mapped into the host address space and the
application can access the contents of the mapped region using the pointer
returned by clEnqueueMapImage.
If blocking_map is CL_FALSE
i.e. map operation is non-blocking, the
pointer to the mapped region returned by clEnqueueMapImage cannot be used
until the map command has completed.
The event argument returns an event object which can be used to query the
execution status of the map command.
When the map command is completed, the application can access the contents
of the mapped region using the pointer returned by clEnqueueMapImage.
clEnqueueMapImage will return a pointer to the mapped region.
The errcode_ret is set to CL_SUCCESS
.
A NULL
pointer is returned otherwise with one of the following error
values returned in errcode_ret:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if context associated with command_queue and image are not the same or if context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if image is not a valid image object.MEM_ OBJECT -
CL_INVALID_
if origin or region isVALUE NULL
. -
CL_INVALID_
if region being mapped given by (origin, origin + region) is out of bounds or if values specified in map_flags are not valid.VALUE -
CL_INVALID_
if values in origin and region do not follow rules described in the argument description for origin and region.VALUE -
CL_INVALID_
if image_row_pitch isVALUE NULL
. -
CL_INVALID_
if image is a 3D image, 1D or 2D image array object and image_slice_pitch isVALUE NULL
. -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_INVALID_
if image dimensions (image width, height, specified or compute row and/or slice pitch) for image are not supported by device associated with queue.IMAGE_ SIZE -
CL_IMAGE_
if image format (image channel order and data type) for image are not supported by device associated with queue.FORMAT_ NOT_ SUPPORTED -
CL_MAP_
if there is a failure to map the requested region into the host address space. This error cannot occur for image objects created withFAILURE CL_MEM_
orUSE_ HOST_ PTR CL_MEM_
.ALLOC_ HOST_ PTR -
CL_EXEC_
if the map operation is blocking and the execution status of any of the events in event_wait_list is a negative integer value. This error code is missing before version 1.1.STATUS_ ERROR_ FOR_ EVENTS_ IN_ WAIT_ LIST -
CL_MEM_
if there is a failure to allocate memory for data store associated with image.OBJECT_ ALLOCATION_ FAILURE -
CL_INVALID_
if the device associated with command_queue does not support images (i.e.OPERATION CL_DEVICE_
specified in the Device Queries table isIMAGE_ SUPPORT CL_FALSE
). -
CL_INVALID_
if image has been created withOPERATION CL_MEM_
orHOST_ WRITE_ ONLY CL_MEM_
andHOST_ NO_ ACCESS CL_MAP_
is set in map_flags or if image has been created withREAD CL_MEM_
orHOST_ READ_ ONLY CL_MEM_
andHOST_ NO_ ACCESS CL_MAP_
orWRITE CL_MAP_
is set in map_flags.WRITE_ INVALIDATE_ REGION -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY -
CL_INVALID_
if mapping would lead to overlapping regions being mapped for writing.OPERATION
The pointer returned maps a 1D, 2D or 3D region starting at origin and is at least region[0] pixels in size for a 1D image, 1D image buffer or 1D image array, (image_row_pitch × region[1]) pixels in size for a 2D image or 2D image array, and (image_slice_pitch × region[2]) pixels in size for a 3D image. The result of a memory access outside this region is undefined.
If the image object is created with CL_MEM_
set in mem_flags,
the following will be true:
-
The host_ptr specified in clCreateImage, clCreateImageWithProperties, clCreateImage2D, or clCreateImage3D is guaranteed to contain the latest bits in the region being mapped when the clEnqueueMapImage command has completed.
-
The pointer value returned by clEnqueueMapImage will be derived from the host_ptr specified when the image object is created.
Mapped image objects are unmapped using clEnqueueUnmapMemObject. This is described in Unmapping Mapped Memory Objects.
5.3.7. Image Object Queries
To get information that is common to all memory objects, use the clGetMemObjectInfo function described in Memory Object Queries.
To get information specific to an image object created with clCreateImage, clCreateImageWithProperties, clCreateImage2D, or clCreateImage3D call the function
cl_int clGetImageInfo(
cl_mem image,
cl_image_info param_name,
size_t param_value_size,
void* param_value,
size_t* param_value_size_ret);
-
image specifies the image object being queried.
-
param_name specifies the information to query. The list of supported param_name types and the information returned in param_value by clGetImageInfo is described in the Image Object Queries table.
-
param_value is a pointer to memory where the appropriate result being queried is returned. If param_value is
NULL
, it is ignored. -
param_value_size is used to specify the size in bytes of memory pointed to by param_value. This size must be ≥ size of return type as described in the Image Object Queries table.
-
param_value_size_ret returns the actual size in bytes of data being queried by param_name. If param_value_size_ret is
NULL
, it is ignored.
Image Info | Return type | Description |
---|---|---|
Return the image format descriptor specified when image is created with clCreateImage, clCreateImageWithProperties, clCreateImage2D or clCreateImage3D. |
||
|
Return size of each element of the image memory object given by image in bytes. |
|
|
Returns the row pitch in bytes of a row of elements of the
image object given by image. |
|
|
Returns the slice pitch in bytes of a 2D slice for the 3D
image object or size of each image in a 1D or 2D image array given
by image. |
|
|
Return width of the image in pixels. |
|
|
Return height of the image in pixels. For a 1D image, 1D image buffer and 1D image array object, height = 0. |
|
|
Return depth of the image in pixels. For a 1D image, 1D image buffer, 2D image or 1D and 2D image array object, depth = 0. |
|
Missing before version 1.2. |
|
Return number of images in the image array. If image is not an image array, 0 is returned. |
Missing before version 1.2 and deprecated by version 2.0. |
|
Return buffer object associated with image. |
Missing before version 1.2. |
|
Return |
Missing before version 1.2. |
|
Return |
clGetImageInfo returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if param_name is not valid, or if size in bytes specified by param_value_size is < size of return type as described in the Image Object Queries table and param_value is notVALUE NULL
. -
CL_INVALID_
if image is a not a valid image object.MEM_ OBJECT -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
5.4. Pipes
Pipes are missing before version 2.0. |
A pipe is a memory object that stores data organized as a FIFO. Pipe objects can only be accessed using built-in functions that read from and write to a pipe. Pipe objects are not accessible from the host. A pipe object encapsulates the following information:
-
Packet size in bytes
-
Maximum capacity in packets
-
Information about the number of packets currently in the pipe
-
Data packets
5.4.1. Creating Pipe Objects
To create a pipe object, call the function
cl_mem clCreatePipe(
cl_context context,
cl_mem_flags flags,
cl_uint pipe_packet_size,
cl_uint pipe_max_packets,
const cl_pipe_properties* properties,
cl_int* errcode_ret);
clCreatePipe is missing before version 2.0. |
-
context is a valid OpenCL context used to create the pipe object.
-
flags is a bit-field that is used to specify allocation and usage information such as the memory arena that should be used to allocate the pipe object and how it will be used. The Memory Flags table describes the possible values for flags. Only
CL_MEM_
andREAD_ WRITE CL_MEM_
can be specified when creating a pipe object. If the value specified for flags is 0, the default is used which isHOST_ NO_ ACCESS CL_MEM_
|READ_ WRITE CL_MEM_
.HOST_ NO_ ACCESS -
pipe_packet_size is the size in bytes of a pipe packet.
-
pipe_max_packets specifies the pipe capacity by specifying the maximum number of packets the pipe can hold.
-
properties specifies a list of properties for the pipe and their corresponding values. Each property name is immediately followed by the corresponding desired value. The list is terminated with 0. Currently, in all OpenCL versions, properties must be
NULL
. -
errcode_ret will return an appropriate error code. If errcode_ret is
NULL
, no error code is returned.
clCreatePipe returns a valid non-zero pipe object and errcode_ret is set
to CL_SUCCESS
if the pipe object is created successfully.
Otherwise, it returns a NULL
value with one of the following error values
returned in errcode_ret:
-
CL_INVALID_
if context is not a valid context.CONTEXT -
CL_INVALID_
if no devices in context support pipes.OPERATION -
CL_INVALID_
if values specified in flags are not as defined above.VALUE -
CL_INVALID_
if properties is notVALUE NULL
. -
CL_INVALID_
if pipe_packet_size is 0 or the pipe_packet_size exceedsPIPE_ SIZE CL_DEVICE_
value specified in the Device Queries table for all devices in context or if pipe_max_packets is 0.PIPE_ MAX_ PACKET_ SIZE -
CL_MEM_
if there is a failure to allocate memory for the pipe object.OBJECT_ ALLOCATION_ FAILURE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
Pipes follow the same memory consistency model as defined for buffer and image objects. The pipe state i.e. contents of the pipe across kernel-instances (on the same or different devices) is enforced at a synchronization point.
5.4.2. Pipe Object Queries
To get information that is common to all memory objects, use the clGetMemObjectInfo function described in Memory Object Queries.
To get information specific to a pipe object created with clCreatePipe, call the function
cl_int clGetPipeInfo(
cl_mem pipe,
cl_pipe_info param_name,
size_t param_value_size,
void* param_value,
size_t* param_value_size_ret);
clGetPipeInfo is missing before version 2.0. |
-
pipe specifies the pipe object being queried.
-
param_name specifies the information to query. The list of supported param_name types and the information returned in param_value by clGetPipeInfo is described in the Pipe Object Queries table.
-
param_value is a pointer to memory where the appropriate result being queried is returned. If param_value is
NULL
, it is ignored. -
param_value_size is used to specify the size in bytes of memory pointed to by param_value. This size must be ≥ size of return type as described in the Pipe Object Queries table.
-
param_value_size_ret returns the actual size in bytes of data being queried by param_name. If param_value_size_ret is
NULL
, it is ignored.
clGetPipeInfo returns CL_SUCCESS
if the function is executed successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if pipe is a not a valid pipe object.MEM_ OBJECT -
CL_INVALID_
if param_name is not valid, or if size in bytes specified by param_value_size is < size of return type as described in the Pipe Object Queries table and param_value is notVALUE NULL
. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
Pipe Info | Return type | Description |
---|---|---|
Missing before version 2.0. |
|
Return pipe packet size specified when pipe is created with clCreatePipe. |
Missing before version 2.0. |
|
Return max. number of packets specified when pipe is created with clCreatePipe. |
Missing before version 3.0. |
|
Return the properties argument specified in clCreatePipe. If the properties argument specified in clCreatePipe used to
create pipe was not If the properties argument specified in clCreatePipe used to
create pipe was |
5.5. Querying, Unmapping, Migrating, Retaining and Releasing Memory Objects
5.5.1. Retaining and Releasing Memory Objects
To retain a memory object, call the function
cl_int clRetainMemObject(
cl_mem memobj);
-
memobj specifies the memory object to be retained.
The memobj reference count is incremented.
clRetainMemObject returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if memobj is not a valid memory object (buffer or image object).MEM_ OBJECT -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
clCreateBuffer, clCreateBufferWithProperties, clCreateSubBuffer, clCreateImage, clCreateImageWithProperties, clCreateImage2D, clCreateImage3D and clCreatePipe perform an implicit retain.
To release a memory object, call the function
cl_int clReleaseMemObject(
cl_mem memobj);
-
memobj specifies the memory object to be released.
The memobj reference count is decremented.
clReleaseMemObject returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if memobj is not a valid memory object.MEM_ OBJECT -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
After the memobj reference count becomes zero and commands queued for execution on a command-queue(s) that use memobj have finished, the memory object is deleted. If memobj is a buffer object, memobj cannot be deleted until all sub-buffer objects associated with memobj are deleted. Using this function to release a reference that was not obtained by creating the object or by calling clRetainMemObject causes undefined behavior.
To register a callback function with a memory object that is called when the memory object is destroyed, call the function
cl_int clSetMemObjectDestructorCallback(
cl_mem memobj,
void (CL_CALLBACK* pfn_notify)(cl_mem memobj, void* user_data),
void* user_data);
clSetMemObjectDestructorCallback is missing before version 1.1. |
-
memobj specifies the memory object to register the callback to.
-
pfn_notify is the callback function to register. This callback function may be called asynchronously by the OpenCL implementation. It is the application’s responsibility to ensure that the callback function is thread-safe. The parameters to this callback function are:
-
memobj is the memory object being deleted. When the callback function is called by the implementation, this memory object is not longer valid. memobj is only provided for reference purposes.
-
user_data is a pointer to user-supplied data.
-
-
user_data will be passed as the user_data argument when pfn_notify is called. user_data can be
NULL
.
Each call to clSetMemObjectDestructorCallback registers the specified callback function on a destructor callback stack associated with memobj. The registered callback functions are called in the reverse order in which they were registered. The registered callback functions are called and then the memory object’s resources are freed and the memory object is deleted. Therefore, the memory object destructor callback provides a mechanism for an application to safely re-use or free a host_ptr that was specified when memobj was created and used as the storage bits for the memory object.
clSetMemObjectDestructorCallback returns CL_SUCCESS
if the function is
executed successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if memobj is not a valid memory object.MEM_ OBJECT -
CL_INVALID_
if pfn_notify isVALUE NULL
. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
When the user callback function is called by the implementation, the
contents of the memory region pointed to by host_ptr (if the memory object
is created with The behavior of calling expensive system routines, OpenCL API calls to create contexts or command-queues, or blocking OpenCL operations from the following list below, in a callback is undefined.
If an application needs to wait for completion of a routine from the above list in a callback, please use the non-blocking form of the function, and assign a completion callback to it to do the remainder of your work. Note that when a callback (or other code) enqueues commands to a command-queue, the commands are not required to begin execution until the queue is flushed. In standard usage, blocking enqueue calls serve this role by implicitly flushing the queue. Since blocking calls are not permitted in callbacks, those callbacks that enqueue commands on a command queue should either call clFlush on the queue before returning or arrange for clFlush to be called later on another thread. The user callback function may not call OpenCL APIs with the memory object for which the callback function is invoked and for such cases the behavior of OpenCL APIs is considered to be undefined. |
5.5.2. Unmapping Mapped Memory Objects
To enqueue a command to unmap a previously mapped region of a memory object, call the function
cl_int clEnqueueUnmapMemObject(
cl_command_queue command_queue,
cl_mem memobj,
void* mapped_ptr,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
-
command_queue must be a valid host command-queue.
-
memobj is a valid memory (buffer or image) object. The OpenCL context associated with command_queue and memobj must be the same.
-
mapped_ptr is the host address returned by a previous call to clEnqueueMapBuffer, or clEnqueueMapImage for memobj.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before clEnqueueUnmapMemObject can be executed. If event_wait_list is
NULL
, then clEnqueueUnmapMemObject does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array.
Reads or writes from the host using the pointer returned by clEnqueueMapBuffer or clEnqueueMapImage are considered to be complete.
clEnqueueMapBuffer and clEnqueueMapImage increment the mapped count of the memory object. The initial mapped count value of the memory object is zero. Multiple calls to clEnqueueMapBuffer, or clEnqueueMapImage on the same memory object will increment this mapped count by appropriate number of calls. clEnqueueUnmapMemObject decrements the mapped count of the memory object.
clEnqueueMapBuffer, and clEnqueueMapImage act as synchronization points for a region of the buffer object being mapped.
clEnqueueUnmapMemObject returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if memobj is not a valid memory object or is a pipe object.MEM_ OBJECT -
CL_INVALID_
if mapped_ptr is not a valid pointer returned by clEnqueueMapBuffer or clEnqueueMapImage for memobj.VALUE -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or if event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY -
CL_INVALID_
if context associated with command_queue and memobj are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT
5.5.3. Accessing mapped regions of a memory object
This section describes the behavior of OpenCL commands that access mapped regions of a memory object.
The contents of the region of a memory object and associated memory objects
(sub-buffer objects or 1D image buffer objects that overlap this region)
mapped for writing (i.e. CL_MAP_
or CL_MAP_
is
set in map_flags argument to clEnqueueMapBuffer, or clEnqueueMapImage)
are considered to be undefined until this region is unmapped.
Multiple commands in command-queues can map a region or overlapping regions
of a memory object and associated memory objects (sub-buffer objects or 1D
image buffer objects that overlap this region) for reading (i.e. map_flags
= CL_MAP_
).
The contents of the regions of a memory object mapped for reading can also
be read by kernels and other OpenCL commands (such as clEnqueueCopyBuffer)
executing on a device(s).
Mapping (and unmapping) overlapped regions in a memory object and/or
associated memory objects (sub-buffer objects or 1D image buffer objects
that overlap this region) for writing is an error and will result in
CL_INVALID_
error returned by clEnqueueMapBuffer, or
clEnqueueMapImage.
If a memory object is currently mapped for writing, the application must ensure that the memory object is unmapped before any enqueued kernels or commands that read from or write to this memory object or any of its associated memory objects (sub-buffer or 1D image buffer objects) or its parent object (if the memory object is a sub-buffer or 1D image buffer object) begin execution; otherwise the behavior is undefined.
If a memory object is currently mapped for reading, the application must ensure that the memory object is unmapped before any enqueued kernels or commands that write to this memory object or any of its associated memory objects (sub-buffer or 1D image buffer objects) or its parent object (if the memory object is a sub-buffer or 1D image buffer object) begin execution; otherwise the behavior is undefined.
A memory object is considered as mapped if there are one or more active mappings for the memory object irrespective of whether the mapped regions span the entire memory object.
Accessing the contents of the memory region referred to by the mapped pointer that has been unmapped is undefined.
The mapped pointer returned by clEnqueueMapBuffer or clEnqueueMapImage can be used as the ptr argument value to clEnqueueReadBuffer, clEnqueueWriteBuffer, clEnqueueReadBufferRect, clEnqueueWriteBufferRect, clEnqueueReadImage, or clEnqueueWriteImage provided the rules described above are adhered to.
5.5.4. Migrating Memory Objects
Migrating memory objects is missing before version 1.2. |
This section describes a mechanism for assigning which device an OpenCL memory object resides. A user may wish to have more explicit control over the location of their memory objects on creation. This could be used to:
-
Ensure that an object is allocated on a specific device prior to usage.
-
Preemptively migrate an object from one device to another.
To enqueue a command to indicate which device a set of memory objects should be associated with, call the function
cl_int clEnqueueMigrateMemObjects(
cl_command_queue command_queue,
cl_uint num_mem_objects,
const cl_mem* mem_objects,
cl_mem_migration_flags flags,
cl_uint num_events_in_wait_list,
const cl_event* event_wait_list,
cl_event* event);
clEnqueueMigrateMemObjects is missing before version 1.2. |
-
command_queue is a valid host command-queue. The specified set of memory objects in mem_objects will be migrated to the OpenCL device associated with command_queue or to the host if the
CL_MIGRATE_
has been specified.MEM_ OBJECT_ HOST -
num_mem_objects is the number of memory objects specified in mem_objects.
-
mem_objects is a pointer to a list of memory objects.
-
flags is a bit-field that is used to specify migration options. The Memory Migration Flags describes the possible values for flags.
-
event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. If event_wait_list is
NULL
, then this particular command does not wait on any event to complete. If event_wait_list isNULL
, num_events_in_wait_list must be 0. If event_wait_list is notNULL
, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns. -
event returns an event object that identifies this command and can be used to query or queue a wait for this command to complete. If event is
NULL
or the enqueue is unsuccessful, no event will be created and therefore it will not be possible to query the status of this command or to wait for this command to complete. If event_wait_list and event are notNULL
, event must not refer to an element of the event_wait_list array.
Memory Migration Flags | Description |
---|---|
Missing before version 1.2. |
This flag indicates that the specified set of memory objects are to be migrated to the host, regardless of the target command-queue. |
Missing before version 1.2. |
This flag indicates that the contents of the set of memory objects are undefined after migration. The specified set of memory objects are migrated to the device associated with command_queue without incurring the overhead of migrating their contents. |
Typically, memory objects are implicitly migrated to a device for which enqueued commands, using the memory object, are targeted. clEnqueueMigrateMemObjects allows this migration to be explicitly performed ahead of the dependent commands. This allows a user to preemptively change the association of a memory object, through regular command queue scheduling, in order to prepare for another upcoming command. This also permits an application to overlap the placement of memory objects with other unrelated operations before these memory objects are needed potentially hiding transfer latencies. Once the event, returned from clEnqueueMigrateMemObjects, has been marked CL_COMPLETE the memory objects specified in mem_objects have been successfully migrated to the device associated with command_queue. The migrated memory object shall remain resident on the device until another command is enqueued that either implicitly or explicitly migrates it away.
clEnqueueMigrateMemObjects can also be used to direct the initial placement of a memory object, after creation, possibly avoiding the initial overhead of instantiating the object on the first enqueued command to use it.
The user is responsible for managing the event dependencies, associated with this command, in order to avoid overlapping access to memory objects. Improperly specified event dependencies passed to clEnqueueMigrateMemObjects could result in undefined results.
clEnqueueMigrateMemObjects return CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if command_queue is not a valid host command-queue.COMMAND_ QUEUE -
CL_INVALID_
if the context associated with command_queue and memory objects in mem_objects are not the same or if the context associated with command_queue and events in event_wait_list are not the same.CONTEXT -
CL_INVALID_
if any of the memory objects in mem_objects is not a valid memory object.MEM_ OBJECT -
CL_INVALID_
if num_mem_objects is zero or if mem_objects isVALUE NULL
. -
CL_INVALID_
if flags is not 0 or is not any of the values described in the table above.VALUE -
CL_INVALID_
if event_wait_list isEVENT_ WAIT_ LIST NULL
and num_events_in_wait_list > 0, or event_wait_list is notNULL
and num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events. -
CL_MEM_
if there is a failure to allocate memory for the specified set of memory objects in mem_objects.OBJECT_ ALLOCATION_ FAILURE -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
5.5.5. Memory Object Queries
To get information that is common to all memory objects (buffer and image objects), call the function
cl_int clGetMemObjectInfo(
cl_mem memobj,
cl_mem_info param_name,
size_t param_value_size,
void* param_value,
size_t* param_value_size_ret);
-
memobj specifies the memory object being queried.
-
param_name specifies the information to query. The list of supported param_name types and the information returned in param_value by clGetMemObjectInfo is described in the Memory Object Info table.
-
param_value is a pointer to memory where the appropriate result being queried is returned. If param_value is
NULL
, it is ignored. -
param_value_size is used to specify the size in bytes of memory pointed to by param_value. This size must be ≥ size of return type as described in the Memory Object Info table.
-
param_value_size_ret returns the actual size in bytes of data being queried by param_name. If param_value_size_ret is
NULL
, it is ignored.
Memory Object Info | Return type | Description |
---|---|---|
|
Returns one of the following values: The value of image_desc→image_type if memobj is created with clCreateImage or clCreateImageWithProperties. |
|
|
Return the flags argument value specified when memobj is created
with clCreateBuffer, If memobj is a sub-buffer the memory access qualifiers inherited from parent buffer is also returned. |
|
|
Return actual size of the data store associated with memobj in bytes. |
|
|
If memobj is created with clCreateBuffer, clCreateBufferWithProperties,
clCreateImage, clCreateImageWithProperties, clCreateImage2D, or
clCreateImage3D, and Otherwise, if memobj is created with clCreateSubBuffer, and
memobj is created from a buffer that was created with
Otherwise, returns |
|
|
Map count. |
|
|
Return memobj reference count. |
|
|
Return context specified when memory object is created. If memobj is created using clCreateSubBuffer, the context associated with the memory object specified as the buffer argument to clCreateSubBuffer is returned. |
|
Missing before version 1.1. |
|
Return memory object from which memobj is created. This returns the memory object specified as buffer argument to clCreateSubBuffer if memobj is a subbuffer object created using clCreateSubBuffer. This returns image_desc→mem_object if memobj is an image object created using clCreateImage or clCreateImageWithProperties. Otherwise, returns |
Missing before version 1.1. |
|
Return offset if memobj is a sub-buffer object created using clCreateSubBuffer. This return 0 if memobj is not a subbuffer object. |
Missing before version 2.0. |
|
Return |
Missing before version 3.0. |
|
Return the properties argument specified in clCreateBufferWithProperties or clCreateImageWithProperties. If the properties argument specified in
clCreateBufferWithProperties or clCreateImageWithProperties
used to create memobj was not If memobj was created using clCreateBuffer,
clCreateSubBuffer, clCreateImage, clCreateImage2D, or
clCreateImage3D, or if the properties argument specified
in clCreateBufferWithProperties or
clCreateImageWithProperties was |
clGetMemObjectInfo returns CL_SUCCESS
if the function is executed
successfully.
Otherwise, it returns one of the following errors:
-
CL_INVALID_
if memobj is a not a valid memory object.MEM_ OBJECT -
CL_INVALID_
if param_name is not valid, or if size in bytes specified by param_value_size is < size of return type as described in the Memory Object Info table and param_value is notVALUE NULL
. -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the device.OF_ RESOURCES -
CL_OUT_
if there is a failure to allocate resources required by the OpenCL implementation on the host.OF_ HOST_ MEMORY
5.6. Shared Virtual Memory
Shared virtual memory is missing before version 2.0. |
Shared virtual memory (a.k.a. SVM) allows the host and kernels executing on devices to directly share complex, pointer-containing data structures such as trees and linked lists. It also eliminates the need to marshal data between the host and devices. As a result, SVM substantially simplifies OpenCL programming and may improve performance.
5.6.1. SVM sharing granularity: coarse- and fine- grained sharing
OpenCL maintains memory consistency in a coarse-grained fashion in regions of buffers. We call this coarse-grained sharing. Many platforms such as those with integrated CPU-GPU processors and ones using the SVM-related PCI-SIG IOMMU services can do better, and can support sharing at a granularity smaller than a buffer. We call this fine-grained sharing.
-
Coarse-grained sharing: Coarse-grain sharing may be used for memory and virtual pointer sharing between multiple devices as well as between the host and one or more devices. The shared memory region is a memory buffer allocated using clSVMAlloc. Memory consistency is guaranteed at synchronization points and the host can use calls to clEnqueueSVMMap and clEnqueueSVMUnmap or create a
cl_mem
buffer object using the SVM pointer and use OpenCL’s existing host API functions clEnqueueMapBuffer and clEnqueueUnmapMemObject to update regions of the buffer. What coarse-grain buffer SVM adds to OpenCL’s earlier buffer support are the ability to share virtual memory pointers and a guarantee that concurrent access to the same memory allocation from multiple kernels on a single device is valid. The coarse-grain buffer SVM provides a memory consistency model similar to the global memory consistency model described in sections 3.3.1 and 3.4.3 of the OpenCL 1.2 specification. This memory consistency applies to the regions of buffers being shared in a coarse-grained fashion. It is enforced at the synchronization points between commands enqueued to command queues in a single context with the additional consideration that multiple kernels concurrently running on the same device may safely share the data. -
Fine-grained sharing: Shared virtual memory where memory consistency is maintained at a granularity smaller than a buffer. How fine-grained SVM is used depends on whether the device supports SVM atomic operations.
-
If SVM atomic operations are supported, they provide memory consistency for loads and stores by the host and kernels executing on devices supporting SVM. This means that the host and devices can concurrently read and update the same memory. The consistency provided by SVM atomics is in addition to the consistency provided at synchronization points. There is no need for explicit calls to clEnqueueSVMMap and clEnqueueSVMUnmap or clEnqueueMapBuffer and clEnqueueUnmapMemObject on a
cl_mem
buffer object created using the SVM pointer. -
If SVM atomic operations are not supported, the host and devices can concurrently read the same memory locations and can concurrently update non-overlapping memory regions, but attempts to update the same memory locations are undefined. Memory consistency is guaranteed at synchronization points without the need for explicit calls to clEnqueueSVMMap and clEnqueueSVMUnmap or clEnqueueMapBuffer and clEnqueueUnmapMemObject on a
cl_mem
buffer object created using the SVM pointer.
-
-
There are two kinds of fine-grain sharing support. Devices may support either fine-grain buffer sharing or fine-grain system sharing.
-
Fine-grain buffer sharing provides fine-grain SVM only within buffers and is an extension of coarse-grain sharing. To support fine-grain buffer sharing in an OpenCL context, all devices in the context must support
CL_DEVICE_
.SVM_ FINE_ GRAIN_ BUFFER -
Fine-grain system sharing enables fine-grain sharing of the host’s entire virtual memory, including memory regions allocated by the system malloc API. OpenCL buffer objects are unnecessary and programmers can pass pointers allocated using malloc to OpenCL kernels.
-
As an illustration of fine-grain SVM using SVM atomic operations to maintain memory consistency, consider the following example. The host and a set of devices can simultaneously access and update a shared work-queue data structure holding work-items to be done. The host can use atomic operations to insert new work-items into the queue at the same time as the devices using similar atomic operations to remove work-items for processing.
It is the programmer’s responsibility to ensure that no host code or executing kernels attempt to access a shared memory region after that memory is freed. We require the SVM implementation to work with either 32- or 64- bit host applications subject to the following requirement: the address space size must be the same for the host and all OpenCL devices in the context.
To allocate a shared virtual memory buffer (referred to as a SVM buffer) that can be shared by the host and all devices in an OpenCL context that support shared virtual memory, call the function
void* clSVMAlloc(
cl_context context,
cl_svm_mem_flags flags,
size_t size,
cl_uint alignment);
clSVMAlloc is missing before version 2.0. |
-
context is a valid OpenCL context used to create the SVM buffer.
-
flags is a bit-field that is used to specify allocation and usage information. The SVM Memory Flags table describes the possible values for flags.
-
size is the size in bytes of the SVM buffer to be allocated.
-
alignment is the minimum alignment in bytes that is required for the newly created buffers memory region. It must be a power of two up to the largest data type supported by the OpenCL device. For the full profile, the largest data type is long16. For the embedded profile, it is long16 if the device supports 64-bit integers; otherwise it is int16. If alignment is 0, a default alignment will be used that is equal to the size of largest data type supported by the OpenCL implementation.
SVM Memory Flags | Description |
---|