cl_intel_unified_shared

Name Strings

cl_intel_unified_shared_memory

Contact

Ben Ashbaugh, Intel (ben 'dot' ashbaugh 'at' intel 'dot' com)

Contributors

Ben Ashbaugh, Intel
James Brodman, Intel
Maciej Dziuban, Intel
Krzysztof Gibala, Intel
Wenju He, Intel
Kris Kang, Intel
Michael Kinsner, Intel
Michal Mrozek, Intel
Lukasz Towarek, Intel

Notice

Status

Shipping

Version

Built On: 2025-06-18
Revision: 1.1.0

Dependencies

This extension is written against the OpenCL API Specification Version 3.0.9. This extension extends the clSetKernelExecInfo API from OpenCL 2.0 and hence requires an OpenCL 2.0 platform, however it is intended to be implementable by devices supporting many diverse OpenCL versions.

Overview

This extension adds "Unified Shared Memory" (USM) to OpenCL. Unified Shared Memory provides:

Easier integration into existing code bases by representing OpenCL allocations as pointers rather than handles (cl_mems), with full support for pointer arithmetic into allocations.
Fine-grain control over ownership and accessibility of OpenCL allocations, to optimally choose between performance and programmer convenience.
A simpler programming model, by automatically migrating some allocations between OpenCL devices and the host.

While Unified Shared Memory (USM) shares many features with Shared Virtual Memory (SVM), Unified Shared Memory provides a different mix of capabilities and control. Specifically:

The matrix of USM capabilities supports combinations of features beyond the SVM capability queries.
USM provides explicit control over memory placement and migration by supporting host allocations with wide visibility, devices allocations for best performance, and shared allocations that may migrate between devices and the host.
USM allocations may be associated with both a device and a context. The USM allocation APIs support additional memory flags and optional properties to affect how memory is allocated and migrated.
There is no need for APIs to map or unmap USM allocations, because host accessible USM allocations do not need to be mapped or unmapped to access the contents of a USM allocation on the host.
An application may indicate that a kernel may access categories of USM allocations indirectly, without passing a set of all indirectly accessed USM allocations to the kernel, improving usability and reducing driver overhead for kernels that access many USM allocations.
USM adds API functions to query properties of a USM allocation and to provide memory advice for an allocation.

Unified Shared Memory and Shared Virtual Memory can and will coexist for many implementations. All implementations that support Shared Virtual Memory may support at least some types of Unified Shared Memory.

New API Functions

void*   clHostMemAllocINTEL(
            cl_context context,
            const cl_mem_properties_intel* properties,
            size_t size,
            cl_uint alignment,
            cl_int* errcode_ret);

void*   clDeviceMemAllocINTEL(
            cl_context context,
            cl_device_id device,
            const cl_mem_properties_intel* properties,
            size_t size,
            cl_uint alignment,
            cl_int* errcode_ret);

void*   clSharedMemAllocINTEL(
            cl_context context,
            cl_device_id device,
            const cl_mem_properties_intel* properties,
            size_t size,
            cl_uint alignment,
            cl_int* errcode_ret);

cl_int  clMemFreeINTEL(
            cl_context context,
            void* ptr);

cl_int  clMemBlockingFreeINTEL(
            cl_context context,
            void* ptr);

cl_int  clGetMemAllocInfoINTEL(
            cl_context context,
            const void* ptr,
            cl_mem_info_intel param_name,
            size_t param_value_size,
            void* param_value,
            size_t* param_value_size_ret);

cl_int  clSetKernelArgMemPointerINTEL(
            cl_kernel kernel,
            cl_uint arg_index,
            const void* arg_value);

cl_int  clEnqueueMemFillINTEL(
            cl_command_queue command_queue,
            void* dst_ptr,
            const void* pattern,
            size_t pattern_size,
            size_t size,
            cl_uint num_events_in_wait_list,
            const cl_event* event_wait_list,
            cl_event* event);

cl_int  clEnqueueMemcpyINTEL(
            cl_command_queue command_queue,
            cl_bool blocking,
            void* dst_ptr,
            const void* src_ptr,
            size_t size,
            cl_uint num_events_in_wait_list,
            const cl_event* event_wait_list,
            cl_event* event);

cl_int  clEnqueueMigrateMemINTEL(
            cl_command_queue command_queue,
            const void* ptr,
            size_t size,
            cl_mem_migration_flags flags,
            cl_uint num_events_in_wait_list,
            const cl_event* event_wait_list,
            cl_event* event);

cl_int  clEnqueueMemAdviseINTEL(
            cl_command_queue command_queue,
            const void* ptr,
            size_t size,
            cl_mem_advice_intel advice,
            cl_uint num_events_in_wait_list,
            const cl_event* event_wait_list,
            cl_event* event);

New API Enums

Accepted value for the param_name parameter to clGetDeviceInfo to query the Unified Shared Memory capabilities of an OpenCL device:

#define CL_DEVICE_HOST_MEM_CAPABILITIES_INTEL                   0x4190
#define CL_DEVICE_DEVICE_MEM_CAPABILITIES_INTEL                 0x4191
#define CL_DEVICE_SINGLE_DEVICE_SHARED_MEM_CAPABILITIES_INTEL   0x4192
#define CL_DEVICE_CROSS_DEVICE_SHARED_MEM_CAPABILITIES_INTEL    0x4193
#define CL_DEVICE_SHARED_SYSTEM_MEM_CAPABILITIES_INTEL          0x4194

Bitfield type and bits describing the Unified Shared Memory capabilities of an OpenCL device:

typedef cl_bitfield cl_device_unified_shared_memory_capabilities_intel;

#define CL_UNIFIED_SHARED_MEMORY_ACCESS_INTEL                   (1 << 0)
#define CL_UNIFIED_SHARED_MEMORY_ATOMIC_ACCESS_INTEL            (1 << 1)
#define CL_UNIFIED_SHARED_MEMORY_CONCURRENT_ACCESS_INTEL        (1 << 2)
#define CL_UNIFIED_SHARED_MEMORY_CONCURRENT_ATOMIC_ACCESS_INTEL (1 << 3)

Type to describe optional Unified Shared Memory allocation properties:

typedef cl_bitfield cl_mem_properties_intel;

Enumerant value requesting optional allocation properties for a Unified Shared Memory allocation:

#define CL_MEM_ALLOC_FLAGS_INTEL        0x4195

Bitfield type and bits describing optional allocation properties for a Unified Shared Memory allocation:

typedef cl_bitfield cl_mem_alloc_flags_intel;

#define CL_MEM_ALLOC_WRITE_COMBINED_INTEL               (1 << 0)
#define CL_MEM_ALLOC_INITIAL_PLACEMENT_DEVICE_INTEL     (1 << 1)
#define CL_MEM_ALLOC_INITIAL_PLACEMENT_HOST_INTEL       (1 << 2)

Enumeration type and values for the param_name parameter to clGetMemAllocInfoINTEL to query information about a Unified Shared Memory allocation. Optional allocation properties may also be queried using clGetMemAllocInfoINTEL:

typedef cl_uint cl_mem_info_intel;

#define CL_MEM_ALLOC_TYPE_INTEL         0x419A
#define CL_MEM_ALLOC_BASE_PTR_INTEL     0x419B
#define CL_MEM_ALLOC_SIZE_INTEL         0x419C
#define CL_MEM_ALLOC_DEVICE_INTEL       0x419D
/* CL_MEM_ALLOC_FLAGS_INTEL - defined above */

Enumeration type and values describing the type of Unified Shared Memory allocation. Returned by clGetMemAllocInfoINTEL when param_name is CL_MEM_ALLOC_TYPE_INTEL:

typedef cl_uint cl_unified_shared_memory_type_intel;

#define CL_MEM_TYPE_UNKNOWN_INTEL       0x4196
#define CL_MEM_TYPE_HOST_INTEL          0x4197
#define CL_MEM_TYPE_DEVICE_INTEL        0x4198
#define CL_MEM_TYPE_SHARED_INTEL        0x4199

Enumeration type and values for the advice parameter to clEnqueueMemAdviseINTEL to provide memory advice for a Unified Shared Memory allocation:

typedef cl_uint cl_mem_advice_intel;
/* Enum values 0x4208-0x420F are reserved for future memory advices. */

Accepted value for the param_name parameter to clSetKernelExecInfo to specify that the kernel may indirectly access Unified Shared Memory allocations of the specified type:

#define CL_KERNEL_EXEC_INFO_INDIRECT_HOST_ACCESS_INTEL      0x4200
#define CL_KERNEL_EXEC_INFO_INDIRECT_DEVICE_ACCESS_INTEL    0x4201
#define CL_KERNEL_EXEC_INFO_INDIRECT_SHARED_ACCESS_INTEL    0x4202

Accepted value for the param_name parameter to clSetKernelExecInfo to specify a set of Unified Shared Memory allocations that the kernel may indirectly access:

#define CL_KERNEL_EXEC_INFO_USM_PTRS_INTEL                  0x4203

New return values from clGetEventInfo when param_name is CL_EVENT_COMMAND_TYPE:

#define CL_COMMAND_MEMFILL_INTEL        0x4204
#define CL_COMMAND_MEMCPY_INTEL         0x4205
#define CL_COMMAND_MIGRATEMEM_INTEL     0x4206
#define CL_COMMAND_MEMADVISE_INTEL      0x4207

Modifications to the OpenCL API Specification

Section 4.2 - Querying Devices:

Add to Table 5 - List of supported param_names by clGetDeviceInfo:

Table 5. List of supported param_names by clGetDeviceInfo
Device Info	Return Type	Description
`CL_DEVICE_HOST_MEM_CAPABILITIES_INTEL` `CL_DEVICE_DEVICE_MEM_CAPABILITIES_INTEL` `CL_DEVICE_SINGLE_DEVICE_SHARED_MEM_CAPABILITIES_INTEL` `CL_DEVICE_CROSS_DEVICE_SHARED_MEM_CAPABILITIES_INTEL` `CL_DEVICE_SHARED_SYSTEM_MEM_CAPABILITIES_INTEL`	`cl_device_unified_shared_memory_capabilities_intel`	Describes the ability for a device to access Unified Shared Memory allocations of the specified type. The host memory access capabilities apply to any host allocation. The device memory access capabilities apply to any device allocation associated with this device. The single device shared memory access capabilities apply to any shared allocation associated with this device. The cross-device shared memory access capabilities apply to any shared allocation associated with this device, or to any shared memory allocation on another device that also supports the same cross-device shared memory access capability. The shared system memory access capabilities apply to any allocations made by a system allocator, such as `malloc` or `new`. The access capabilities are encoded as bits in a bitfield. Supported capabilities are: `CL_UNIFIED_SHARED_MEMORY_ACCESS_INTEL`: The device may access (read or write) Unified Shared Memory allocations of this type. `CL_UNIFIED_SHARED_MEMORY_ATOMIC_ACCESS_INTEL`: The device may perform atomic operations on Unified Shared Memory allocations of this type. `CL_UNIFIED_SHARED_MEMORY_CONCURRENT_ACCESS_INTEL`: The device supports concurrent access to Unified Shared Memory allocations of this type. Concurrent access may be from the host, or from other OpenCL devices, where applicable. `CL_UNIFIED_SHARED_MEMORY_CONCURRENT_ATOMIC_ACCESS_INTEL`: The device supports concurrent atomic access to Unified Shared Memory allocations of this type.

New Section 5.X - Unified Shared Memory

This section describes Unified Shared Memory, abbreviated USM. Unified Shared Memory allocations are represented as pointers in the host application, rather than as handles (specifically, cl_mems). Unified Shared Memory additionally provides fine-grain control over placement and accessibility of an allocation, allowing many tradeoffs between programmer convenience and performance.

Three types of Unified Shared Memory allocations are supported. The type describes the ownership of the allocation:

Host allocations are owned by the host and are intended to be allocated out of system memory. Host allocations are accessible by the host and one or more devices. The same pointer to a host allocation may be used on the host and all supported devices; they have address equivalence. Host allocations are not expected to migrate between system memory and device local memory. Host allocations trade off wide accessibility and transfer benefits for potentially higher per-access costs, such as over PCI express.
Device allocations are owned by a specific device and are intended to be allocated out of device local memory, if present. Device allocations generally trade off access limitations for higher performance. With very few exceptions, device allocations may only be accessed by the specific device they are allocated on, or copied to a host or another device allocation. The same pointer to a device allocation may be used on any supported device.
Shared allocations share ownership and are intended to migrate between the host and one or more devices. Shared allocations are accessible by at least the host and an associated device. Shared allocations may be accessed by other devices in some cases. Shared allocations trade off transfer costs for per-access benefits. The same pointer to a shared allocation may be used on the host and all supported devices.

A Shared System allocation is a sub-class of a Shared allocation, where the memory is allocated by a system allocator - such as malloc or new - rather than by a USM allocation API. Shared system allocations have no associated device - they are inherently cross-device. Like other shared allocations, shared system allocations are intended to migrate between the host and supported devices, and the same pointer to a shared system allocation may be used on the host and all supported devices.

Table 1. Summary of Unified Shared Memory Capabilities
Name	Initial Location	Accessible By		Migratable To
Host	Host	Host	Yes	Host	N/A
Host	Host	Any Device	Yes (perhaps over a bus, such as PCIe)	Device	No
Device	Specific Device	Host	No	Host	No
		Specific Device	Yes	Device	N/A
		Another Device	Optional	Another Device	No
Shared	Host, or Specific Device, Or Unspecified	Host	Yes	Host	Yes
		Specific Device	Yes	Device	Yes
		Another Device	Optional	Another Device	Optional
Shared System	Host	Host	Yes	Host	Yes
Shared System	Host	Device	Yes	Device	Yes

OpenCL devices may support different capabilities for each type of Unified Shared Memory allocation. Supported capabilities are:

CL_UNIFIED_SHARED_MEMORY_ACCESS_INTEL: The device may access (read or write) Unified Shared Memory allocations of this type.
CL_UNIFIED_SHARED_MEMORY_ATOMIC_ACCESS_INTEL: The device may perform atomic operations on Unified Shared Memory allocations of this type.
CL_UNIFIED_SHARED_MEMORY_CONCURRENT_ACCESS_INTEL: The device supports concurrent access to Unified Shared Memory allocations of this type. Concurrent access may be from the host, or from other OpenCL devices, where applicable.
CL_UNIFIED_SHARED_MEMORY_CONCURRENT_ATOMIC_ACCESS_INTEL: The device supports concurrent atomic access to Unified Shared Memory allocations of this type.

Some devices may oversubscribe some shared allocations. When and how such oversubscription occurs, including which allocations are evicted when the working set changes, are considered implementation details.

The minimum set of capabilities are:

Table 2. Minimum Unified Shared Memory Capabilities
Allocation Type	Access	Atomic Access	Concurrent Access	Concurrent Atomic Access
Host	Optional	Optional	Optional	Optional
Device	Required	Optional	Optional	Optional
Shared	Optional	Optional	Optional	Optional
Shared (Cross-Device)	Optional	Optional	Optional	Optional
Shared System (Cross-Device)	Optional	Optional	Optional	Optional

Allocating and Freeing Unified Shared Memory

Host Allocations

The function

void*   clHostMemAllocINTEL(
            cl_context context,
            const cl_mem_properties_intel* properties,
            size_t size,
            cl_uint alignment,
            cl_int* errcode_ret);

allocates host Unified Shared Memory.

context is a valid OpenCL context used to allocate the host memory.

properties is an optional list of allocation properties and their corresponding values. The list is terminated with the special property 0. If no allocation properties are required, properties may be NULL. Please refer to the table below for valid property values and their description.

size is the size in bytes of the requested host allocation.

alignment is the minimum alignment in bytes for the requested host allocation. It must be a power of two and must be equal to or smaller than the size of the largest data type supported by any OpenCL device in context. If alignment is 0, a default alignment will be used that is equal to the size of the largest data type supported by any OpenCL device in context.

errcode_ret may return an appropriate error code. If errcode_ret is NULL then no error code will be returned.

clHostMemAllocINTEL will return a valid non-NULL address and CL_SUCCESS will be returned in errcode_ret if the host Unified Shared Memory is allocated successfully. Otherwise, NULL will be returned, and errcode_ret will be set to one of the following error values:

CL_INVALID_CONTEXT if context is not a valid context.
CL_INVALID_OPERATION if CL_DEVICE_HOST_MEM_CAPABILITIES_INTEL is zero for all devices in context, indicating that no devices in context support host Unified Shared Memory allocations.
CL_INVALID_VALUE if alignment is not zero or a power of two.
CL_INVALID_VALUE if alignment is greater than the size of the largest data type supported by any OpenCL device in context that supports host Unified Shared Memory allocations.
CL_INVALID_PROPERTY if a memory property name in properties is not a supported property name, if the value specified for a supported property name is not valid, or if the same property name is specified more than once.
CL_INVALID_PROPERTY if either the CL_MEM_ALLOC_INITIAL_PLACEMENT_DEVICE_INTEL or CL_MEM_ALLOC_INITIAL_PLACEMENT_HOST_INTEL flags are specified.
CL_INVALID_BUFFER_SIZE if size is zero or greater than CL_DEVICE_MAX_MEM_ALLOC_SIZE for any OpenCL device in context that supports host Unified Shared Memory allocations.
CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.
CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by the OpenCL implementation on the host.

Device Allocations

The function

void*   clDeviceMemAllocINTEL(
            cl_context context,
            cl_device_id device,
            const cl_mem_properties_intel* properties,
            size_t size,
            cl_uint alignment,
            cl_int* errcode_ret);

allocates Unified Shared Memory specific to an OpenCL device.

context is a valid OpenCL context used to allocate the device memory.

device is a valid OpenCL device ID to associate with the allocation.

size is the size in bytes of the requested device allocation.

alignment is the minimum alignment in bytes for the requested device allocation. It must be a power of two and must be equal to or smaller than the size of the largest data type supported by device. If alignment is 0, a default alignment will be used that is equal to the size of largest data type supported by device.

errcode_ret may return an appropriate error code. If errcode_ret is NULL then no error code will be returned.

clDeviceMemAllocINTEL will return a valid non-NULL address and CL_SUCCESS will be returned in errcode_ret if the device Unified Shared Memory is allocated successfully. Otherwise, NULL will be returned, and errcode_ret will be set to one of the following error values:

CL_INVALID_CONTEXT if context is not a valid context.
CL_INVALID_DEVICE if device is not a valid device or is not associated with context.
CL_INVALID_OPERATION if CL_DEVICE_DEVICE_MEM_CAPABILITIES_INTEL is zero for device, indicating that device does not support device Unified Shared Memory allocations.
CL_INVALID_VALUE if alignment is not zero or a power of two.
CL_INVALID_VALUE if alignment is greater than the size of the largest data type supported by device.
CL_INVALID_PROPERTY if a memory property name in properties is not a supported property name, if the value specified for a supported property name is not valid, or if the same property name is specified more than once.
CL_INVALID_PROPERTY if either the CL_MEM_ALLOC_INITIAL_PLACEMENT_DEVICE_INTEL or CL_MEM_ALLOC_INITIAL_PLACEMENT_HOST_INTEL flags are specified.
CL_INVALID_BUFFER_SIZE if size is zero or greater than CL_DEVICE_MAX_MEM_ALLOC_SIZE for device.
CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.
CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by the OpenCL implementation on the host.

Shared Allocations

The function

void*   clSharedMemAllocINTEL(
            cl_context context,
            cl_device_id device,
            const cl_mem_properties_intel* properties,
            size_t size,
            cl_uint alignment,
            cl_int* errcode_ret);

allocates Unified Shared Memory with shared ownership between the host and the specified OpenCL device. If the specified OpenCL device supports cross-device access capabilities, the allocation is also accessible by other OpenCL devices in the context that have cross-device access capabilities.

context is a valid OpenCL context used to allocate the Unified Shared Memory.

device is an optional OpenCL device ID to associate with the allocation. If device is NULL then the allocation is not associated with any device. Allocations with no associated device are accessible by the host and OpenCL devices in the context that have cross-device access capabilities.

size is the size in bytes of the requested shared allocation.

alignment is the minimum alignment in bytes for the requested shared allocation. It must be a power of two and must be equal to or smaller than the size of the largest data type supported by device. If alignment is 0, a default alignment will be used that is equal to the size of largest data type supported by device. If device is NULL, alignment must be a power of two equal to or smaller than the size of the largest data type supported by any OpenCL device in context, and the default alignment will be equal to the size of the largest data type supported by any OpenCL device in context.

errcode_ret may return an appropriate error code. If errcode_ret is NULL then no error code will be returned.

clSharedMemAllocINTEL will return a valid non-NULL address and CL_SUCCESS will be returned in errcode_ret if the shared Unified Shared Memory is allocated successfully. Otherwise, NULL will be returned, and errcode_ret will be set to one of the following error values:

CL_INVALID_CONTEXT if context is not a valid context.
CL_INVALID_DEVICE if device is not NULL and is either not a valid device or is not associated with context.
CL_INVALID_OPERATION if device is not NULL and CL_DEVICE_SINGLE_DEVICE_SHARED_MEM_CAPABILITIES_INTEL and CL_DEVICE_CROSS_DEVICE_SHARED_MEM_CAPABILITIES_INTEL are both zero, indicating that device does not support shared Unified Shared Memory allocations, or if device is NULL and no devices in context support shared Unified Shared Memory allocations.
CL_INVALID_VALUE if alignment is not zero or a power of two.
CL_INVALID_VALUE if device is not NULL and alignment is greater than the size of the largest data type supported by device, or if device is NULL and alignment is greater than the size of the largest data type supported by any OpenCL device in context that supports shared Unified Shared Memory allocations.
CL_INVALID_PROPERTY if a memory property name in properties is not a supported property name, if the value specified for a supported property name is not valid, or if the same property name is specified more than once.
CL_INVALID_PROPERTY if both CL_MEM_ALLOC_INITIAL_PLACEMENT_DEVICE_INTEL and CL_MEM_ALLOC_INITIAL_PLACEMENT_HOST_INTEL flags are specified.
CL_INVALID_BUFFER_SIZE if size is zero, or if device is not NULL and size is greater than CL_DEVICE_MAX_MEM_ALLOC_SIZE for device, or if device is NULL and size is greater than CL_DEVICE_MAX_MEM_ALLOC_SIZE for any device in context that supports shared Unified Shared Memory allocations.
CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.
CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by the OpenCL implementation on the host.

Freeing Allocations

The functions

cl_int  clMemFreeINTEL(
            cl_context context,
            void* ptr);

cl_int  clMemBlockingFreeINTEL(
            cl_context context,
            void* ptr);

free a Unified Shared Memory allocation.

context is a valid OpenCL context used to free the Unified Shared Memory allocation.

ptr is the Unified Shared Memory allocation to free. It must be a value returned by clHostMemAllocINTEL, clDeviceMemAllocINTEL, or clSharedMemAllocINTEL, or a NULL pointer. If ptr is NULL then no action occurs.

Note that clMemFreeINTEL may not wait for previously enqueued commands that may be using ptr to finish before freeing ptr. It is the responsibility of the application to make sure enqueued commands that use ptr are complete before freeing ptr. Applications should take particular care freeing memory allocations with kernels that may access memory indirectly, since a kernel with indirect memory access counts as using all memory allocations of the specified type or types. To wait for previously enqueued commands to finish that may be using ptr before freeing ptr, use the clMemBlockingFreeINTEL function instead.

clMemFreeINTEL and clMemBlockingFreeINTEL will return CL_SUCCESS if the function executes successfully. Otherwise, they will return one of the following error values:

CL_INVALID_CONTEXT if context is not a valid context.
CL_INVALID_VALUE if ptr is not a value returned by clHostMemAllocINTEL, clDeviceMemAllocINTEL, clSharedMemAllocINTEL, or a NULL pointer.
CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.
CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by the OpenCL implementation on the host.

Controlling Allocations

The table below describes allocation properties that may be passed to control allocation behavior.

Table 3. List of Supported `cl_mem_properties_intel` Properties
Property	Property Type	Description
`CL_MEM_ALLOC_FLAGS_INTEL`	cl_mem_alloc_flags_intel	Flags specifying allocation and usage information. This is a bitfield type that may be set to a combination of the following values: `CL_MEM_ALLOC_WRITE_COMBINED_INTEL`: Request write combined (WC) memory. Write combined memory may improve performance in some cases, however write combined memory must be used with care since it may hurt performance in other cases or use different coherency protocols than non-write combined memory. `CL_MEM_ALLOC_INITIAL_PLACEMENT_DEVICE_INTEL`: Request the implementation to optimize for first access being done by the device. This flag is valid only for clSharedMemAllocINTEL. This flag does not affect functionality and is purely a performance hint. `CL_MEM_ALLOC_INITIAL_PLACEMENT_HOST_INTEL`: Request the implementation to optimize for first access being done by the host. This flag is valid only for clSharedMemAllocINTEL. This flag does not affect functionality and is purely a performance hint. `CL_MEM_ALLOC_INITIAL_PLACEMENT_DEVICE_INTEL` and `CL_MEM_ALLOC_INITIAL_PLACEMENT_HOST_INTEL` are mutually exclusive.

Unified Shared Memory Queries

The function

cl_int  clGetMemAllocInfoINTEL(
            cl_context context,
            const void* ptr,
            cl_mem_info_intel param_name,
            size_t param_value_size,
            void* param_value,
            size_t* param_value_size_ret);

queries information about a Unified Shared Memory allocation.

context is a valid OpenCL context to query for information about the Unified Shared Memory allocation.

ptr is a pointer into a Unified Shared Memory allocation to query. ptr need not be a value returned by clHostMemAllocINTEL, clDeviceMemAllocINTEL, or clSharedMemAllocINTEL, but the query may be faster if it is.

param_name specifies the information to query. The list of supported param_name values and the information returned in param_value is described in the Unified Memory Allocation Queries table.

param_value is a pointer to memory where the appropriate result being queried is returned. If param_value is NULL, it is ignored.

param_value_size is used to specify the size in bytes of memory pointed to by param_value. This size must be greater than or equal to the size of return type as described in the Unified Memory Allocation Queries table. If param_value is NULL, it is ignored.

param_value_size_ret returns the actual size in bytes of data being queried by param_name. If param_value_size_ret is NULL, it is ignored.

clGetMemAllocInfoINTEL returns CL_SUCCESS if the function is executed successfully. Otherwise, it will return one of the following error values:

CL_INVALID_CONTEXT if context is not a valid context.
CL_INVALID_VALUE if param_name is not a valid Unified Shared Memory allocation query.
CL_INVALID_VALUE if param_value is not NULL and param_value_size is smaller than the size of the query return type.
CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.
CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by the OpenCL implementation on the host.

Table 4. List of supported param_names by clGetMemAllocInfoINTEL
cl_mem_info_intel	Return type	Info. returned in param_value
`CL_MEM_ALLOC_TYPE_INTEL`	cl_unified_shared_memory_type_intel	Returns the type of the Unified Shared Memory allocation. Returns `CL_MEM_TYPE_HOST_INTEL` for allocations made by clHostMemAllocINTEL . Returns `CL_MEM_TYPE_DEVICE_INTEL` for allocations made by clDeviceMemAllocINTEL. Returns `CL_MEM_TYPE_SHARED_INTEL` for allocations made by clSharedMemAllocINTEL. Returns `CL_MEM_TYPE_UNKNOWN_INTEL` if the type of the Unified Shared Memory allocation cannot be determined or if ptr does not point into a Unified Shared Memory allocation.
`CL_MEM_ALLOC_BASE_PTR_INTEL`	void*	Returns the base address of the Unified Shared Memory allocation. Returns `NULL` for `CL_MEM_TYPE_UNKNOWN_INTEL` allocations.
`CL_MEM_ALLOC_SIZE_INTEL`	size_t	Returns the size in bytes of the Unified Shared Memory allocation. Returns `0` for `CL_MEM_TYPE_UNKNOWN_INTEL` allocations.
`CL_MEM_ALLOC_DEVICE_INTEL`	cl_device_id	Returns the device associated with the Unified Shared Memory allocation. Returns `NULL` for `CL_MEM_TYPE_HOST_INTEL` allocations, for `CL_MEM_TYPE_SHARED_INTEL` allocations with no associated device, and for `CL_MEM_TYPE_UNKNOWN_INTEL` allocations.
`CL_MEM_ALLOC_FLAGS_INTEL`	cl_mem_alloc_flags_intel	Returns allocation flags for the Unified Shared Memory allocation. Returns `0` if no allocation flags were specified for the Unified Shared Memory allocation and for `CL_MEM_TYPE_UNKNOWN_INTEL` allocations.

Using Unified Shared Memory with Kernels

The function

cl_int  clSetKernelArgMemPointerINTEL(
            cl_kernel kernel,
            cl_uint arg_index,
            const void* arg_value);

is used to set a pointer into a Unified Shared Memory allocation as an argument to a kernel.

kernel is a valid kernel object.

arg_index is the argument index to set. Arguments to the kernel are referred to by indices that go from 0 for the leftmost argument to n - 1, where n is the total number of arguments declared by a kernel.

arg_value is the pointer value that should be used as the argument specified by arg_index. The pointer value will be used as the argument by all API calls that enqueue a kernel until the argument value is set to a different pointer value by a subsequent call. A pointer may only be set as an argument value for an argument declared to be a pointer to global or constant memory.

The definition of a valid pointer value was changed in extension version 1.1.0:

For extension versions prior to version 1.1.0: For devices supporting shared system allocations, any pointer value is valid. Otherwise, the pointer value must be NULL or must point into a Unified Shared Memory allocation returned by clHostMemAllocINTEL, clDeviceMemAllocINTEL, or clSharedMemAllocINTEL.
For extension versions 1.1.0 and newer: For all devices, any pointer value is valid and may be set as an argument to a kernel.

In this definition, a valid pointer value means that the function will not return an error. It still may not be valid to dereference the pointer inside of a kernel if the memory that the pointer points to is not accessible on the device.

clSetKernelArgMemPointerINTEL returns CL_SUCCESS if the function is executed successfully. Otherwise, it will return one of the following errors:

CL_INVALID_KERNEL if kernel is not a valid kernel object.
CL_INVALID_ARG_INDEX if arg_index is not a valid argument index.
CL_INVALID_ARG_VALUE if arg_value is not a valid argument value.
CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.
CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by the OpenCL implementation on the host.

In addition to direct use of a Unified Shared Memory allocation as a kernel argument, Unified Shared Memory allocations may be accessed by kernels indirectly. The new param_name values described below may be used with the existing clSetKernelExecInfo function to describe how Unified Shared Memory allocations are accessed indirectly by a kernel:

Table 28. List of supported param_names by clSetKernelExecInfo
cl_kernel_exec_info	Type	Description
`CL_KERNEL_EXEC_INFO_USM_PTRS_INTEL`	void*[]	Specifies an explicit set of Unified Shared Memory allocations accessed indirectly by the kernel. The new set replaces any previously specified set of Unified Shared Memory allocations. Initially, the set of Unified Shared Memory allocations accessed indirectly by the kernel is the empty set.
`CL_KERNEL_EXEC_INFO_INDIRECT_HOST_ACCESS_INTEL`	cl_bool	Specifies that the kernel may access any host Unified Shared Memory allocation indirectly. By default, the value for this flag is `CL_FALSE`, indicating that the kernel will only access explicitly specified host Unified Shared Memory allocations.
`CL_KERNEL_EXEC_INFO_INDIRECT_DEVICE_ACCESS_INTEL`	cl_bool	Specifies that the kernel may access any device Unified Shared Memory allocation indirectly. By default, the value for this flag is `CL_FALSE`, indicating that the kernel will only access explicitly specified device Unified Shared Memory allocations.
`CL_KERNEL_EXEC_INFO_INDIRECT_SHARED_ACCESS_INTEL`	cl_bool	Specifies that the kernel may access any shared Unified Shared Memory allocation indirectly. By default, the value for this flag is `CL_FALSE`, indicating that the kernel will only access explicitly specified shared Unified Shared Memory allocations.

The following errors may be returned by clSetKernelExecInfo for these new param_name values:

CL_INVALID_OPERATION if param_name is CL_KERNEL_EXEC_INFO_USM_PTRS_INTEL and no devices in the context associated with kernel support Unified Shared Memory.
CL_INVALID_OPERATION if param_name is CL_KERNEL_EXEC_INFO_INDIRECT_HOST_ACCESS_INTEL and no devices in the context associated with kernel support host Unified Shared Memory allocations.
CL_INVALID_OPERATION if param_name is CL_KERNEL_EXEC_INFO_INDIRECT_DEVICE_ACCESS_INTEL and no devices in the context associated with kernel support device Unified Shared Memory allocations.
CL_INVALID_OPERATION if param_name is CL_KERNEL_EXEC_INFO_INDIRECT_SHARED_ACCESS_INTEL and no devices in the context associated with kernel support shared Unified Shared Memory allocations.

The definition of a valid pointer value specified using CL_KERNEL_EXEC_INFO_USM_PTRS_INTEL was changed in extension version 1.1.0.

Filling and Copying Unified Shared Memory

The function

cl_int  clEnqueueMemFillINTEL(
            cl_command_queue command_queue,
            void* dst_ptr,
            const void* pattern,
            size_t pattern_size,
            size_t size,
            cl_uint num_events_in_wait_list,
            const cl_event* event_wait_list,
            cl_event* event);

fills a region of a memory with the specified pattern.

command_queue is a valid host command-queue. The memory fill command will be queued for execution on the device associated with command_queue.

dst_ptr is a pointer to the start of the memory region to fill. The Unified Shared Memory allocation pointed to by dst_ptr must be valid for the context associated with command_queue, must be accessible by the device associated with command_queue, and must be aligned to pattern_size bytes.

pattern is a pointer to the value to write to the Unified Shared Memory region. The memory associated with pattern can be reused or freed after the function returns.

pattern_size describes the size of of the value to write to the Unified Shared Memory region, in bytes. pattern_size must be a power of two and must be less than or equal to the size of the largest integer or floating-point vector data type supported by the device.

size describes the size of the memory region to set, in bytes. size must be a multiple of pattern_size.

event_wait_list and num_events_in_wait_list specify events that need to complete before this command can be executed. If event_wait_list is NULL, then this command does not wait on any event to complete. If event_wait_list is NULL, num_events_in_wait_list must be 0. If event_wait_list is not NULL, the list of events pointed to by event_wait_list must be valid and num_events_in_wait_list must be greater than 0. The events specified in event_wait_list act as synchronization points. The context associated with events in event_wait_list and command_queue must be the same. The memory associated with event_wait_list can be reused or freed after the function returns.

event returns a unique event object that identifies this command. If event is NULL, no event will be created and therefore it will not be possible to query or wait for this command. If the event_wait_list and the event arguments are not NULL, the event argument must not refer to an element of the event_wait_list array.

clEnqueueMemFillINTEL returns CL_SUCCESS if the command is queued successfully. Otherwise, it will return one of the following errors:

CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host command-queue.
CL_INVALID_CONTEXT if the context associated with command_queue and events in event_wait_list are not the same.
CL_INVALID_VALUE if dst_ptr is NULL, or if dst_ptr is not aligned to pattern_size bytes.
CL_INVALID_VALUE if pattern is NULL.
CL_INVALID_VALUE if pattern_size is not a power of two or is greater than the size of the largest integer or floating-point vector data type supported by the device associated with command_queue.
CL_INVALID_VALUE if size is not a multiple of pattern_size.
CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and num_events_in_wait_list is greater than zero, or if event_wait_list is not NULL and num_events_in_wait_list is zero, or if event objects in event_wait_list are not valid events.
CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.
CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by the OpenCL implementation on the host.

The function

cl_int  clEnqueueMemcpyINTEL(
            cl_command_queue command_queue,
            cl_bool blocking,
            void* dst_ptr,
            const void* src_ptr,
            size_t size,
            cl_uint num_events_in_wait_list,
            const cl_event* event_wait_list,
            cl_event* event);

copies a region of memory from one location to another.

command_queue is a valid host command-queue. The memory copy command will be queued for execution on the device associated with command_queue.

blocking indicates if the copy operation is blocking or non-blocking. If blocking is CL_TRUE, the copy command is blocking, and the function will not return until the copy command is complete. Otherwise, if blocking is CL_FALSE, the copy command is non-blocking, and the contents of the dst_ptr cannot be used nor can the contents of the src_ptr be overwritten until the copy command is complete.

dst_ptr is a pointer to the start of the memory region to copy to. If dst_ptr is a pointer into a Unified Shared Memory allocation it must be valid for the context associated with command_queue.

src_ptr is a pointer to the start of the memory region to copy from. If src_ptr is a pointer into a Unified Shared Memory allocation it must be valid for the context associated with command_queue.

size describes the size of the memory region to copy, in bytes.

clEnqueueMemcpyINTEL returns CL_SUCCESS if the command is queued successfully. Otherwise, it will return one of the following errors:

CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host command-queue.
CL_INVALID_CONTEXT if the context associated with command_queue and events in event_wait_list are not the same.
CL_INVALID_VALUE if either dst_ptr or src_ptr are NULL.
CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and num_events_in_wait_list is greater than zero, or if event_wait_list is not NULL and num_events_in_wait_list is zero, or if event objects in event_wait_list are not valid events.
CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the copy operation is blocking and the execution status of any of the events in event_wait_list is a negative integer value.
CL_MEM_COPY_OVERLAP if the values specified for dst_ptr, src_ptr and size result in an overlapping copy.
CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.
CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by the OpenCL implementation on the host.

Unified Shared Memory Hints

The function

cl_int  clEnqueueMigrateMemINTEL(
            cl_command_queue command_queue,
            const void* ptr,
            size_t size,
            cl_mem_migration_flags flags,
            cl_uint num_events_in_wait_list,
            const cl_event* event_wait_list,
            cl_event* event);

explicitly migrates a region of a shared Unified Shared Memory allocation to the device associated with command_queue. This is a hint that may improve performance and is not required for correctness. Memory migration may not be supported for all allocation types for all devices. If memory migration is not supported for the specified memory range then the migration hint may be ignored. Memory migration may only be supported at a device-specific granularity, such as a page boundary. In this case, the memory range may be expanded such that the start and end of the range satisfy the granularity requirements.

command_queue is a valid host command-queue. The memory migration command will be queued for execution on the device associated with command_queue.

ptr is a pointer to the start of the shared Unified Shared Memory allocation to migrate.

size describes the size of the memory region to migrate.

flags is a bit-field that is used to specify memory migration options.

clEnqueueMigrateMemINTEL returns CL_SUCCESS if the command is queued successfully. Otherwise, it will return one of the following errors:

CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host command-queue.
CL_INVALID_CONTEXT if the context associated with command_queue and events in event_wait_list are not the same.
CL_INVALID_VALUE TODO, are any values of ptr and size considered invalid?
CL_INVALID_VALUE if flags is zero or is not a supported combination of memory migration flags.
CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and num_events_in_wait_list is greater than zero, or if event_wait_list is not NULL and num_events_in_wait_list is zero, or if event objects in event_wait_list are not valid events.
CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.
CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by the OpenCL implementation on the host.

The function

cl_int  clEnqueueMemAdviseINTEL(
            cl_command_queue command_queue,
            const void* ptr,
            size_t size,
            cl_mem_advice_intel advice,
            cl_uint num_events_in_wait_list,
            const cl_event* event_wait_list,
            cl_event* event);

provides advice about a region of a shared Unified Shared Memory allocation. Memory advice is a performance hint only and is not required for correctness. Providing memory advice hints may override driver heuristics that control shared memory behavior. Not all memory advice hints may be supported for all allocation types for all devices. If a memory advice hint is not supported by the device it will be ignored. Memory advice hints may only be supported at a device-specific granularity, such as at a page boundary. In this case, the memory range may be expanded such that the start and end of the range satisfy the granularity requirements.

command_queue is a valid host command-queue. The memory advice hints will be queued for the device associated with command_queue.

ptr is a pointer to the start of the shared Unified Shared Memory allocation.

size describes the size of the memory region.

advice is a bit-field describing the memory advice hints for the region.

clEnqueueMemAdviseINTEL returns CL_SUCCESS if the command is queued successfully. Otherwise, it will return one of the following errors:

CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host command-queue.
CL_INVALID_CONTEXT if the context associated with command_queue and events in event_wait_list are not the same.
CL_INVALID_VALUE TODO, are any values of ptr and size considered invalid?
CL_INVALID_VALUE if advice is not supported advice for the device associated with command_queue.
CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and num_events_in_wait_list is greater than zero, or if event_wait_list is not NULL and num_events_in_wait_list is zero, or if event objects in event_wait_list are not valid events.
CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.
CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by the OpenCL implementation on the host.

Interactions with Other Extensions

If cl_intel_mem_alloc_buffer_location is supported then clDeviceMemAllocINTEL, clSharedMemAllocINTEL, clHostMemAllocINTEL, clGetMemAllocInfoINTEL also accepts CL_MEM_ALLOC_BUFFER_LOCATION_INTEL for cl_mem_properties_intel.

Issues

Is there a minimum supported granularity for concurrent access? For example, might it be possible to concurrently access different pages of an allocation, but not different bytes within the same page?

UNRESOLVED:
What other Unified Shared Memory allocation properties should we support?

UNRESOLVED: The proposed Unified Shared Memory allocation APIs accept cl_mem_alloc_flags_intel. We could also accept (some? all?) cl_mem_flags, for example.
Do we need separate "concurrent access" capabilities for host access vs. device access?

UNRESOLVED: We don’t differentiate right now, but we could differentiate between concurrent host access vs. concurrent access from another device.
What would we need to add to support system allocations?

RESOLVED: Added CL_DEVICE_SHARED_SYSTEM_MEM_CAPABILITIES_INTEL.
Do we need the ability to "register" or "use" an existing host allocations?

UNRESOLVED: Currently, only the ability to "allocate" host memory is supported. If we did support this then there may be alignment and size granularity requirements for "registering" a host allocation.
Do we want to support both a flags argument and a properties argument to the USM allocation APIs?

RESOLVED: The flags argument was folded into the properties in revision C.
What should behavior be for clGetMemAllocInfoINTEL if the passed-in ptr is NULL or doesn’t point into a USM allocation?

RESOLVED: The behavior was defined for all queries for this case in revision G.
Do we want separate "memset" APIs to set to different sized "value", such as 8-bits, 16-bits?, 32-bits, or others? Do we want to go back to a "fill" API?

RESOLVED: Switched to a "fill" API in revision I.

Discussion: The host "memset" only sets to an 8-bit value. Switching back to a "fill" API is very flexible, but perhaps overkill, since it supports any supported integer or floating-point scalar or vector type.
What are the restrictions for the dst_ptr values that can be passed to the "fill" API?
UNRESOLVED: Need to close on:
- Can a device "fill" another device’s allocation? (Recommendation: Yes, if accessible.)
- Can a device "fill" arbitrary host memory? (Recommendation: Maybe?)
- Can a device "fill" a USM allocation from another context? (Recommendation: No.)
What are the restrictions for the src_ptr and dst_ptr values that can be passed to the "memcpy" API?
UNRESOLVED: Need to close on:
- Can a device "memcpy" from another device’s allocation?
- Can a device "memcpy" to another device’s allocation?
- Can a device "memcpy" to or from a USM allocation in another context? (Recommendation: No?)
- Can a device "memcpy" to arbitrary host memory? (Recommendation: Yes.)
- Can a device "memcpy" from arbitrary host memory? (Recommendation: Yes.)
- Can a device "memcpy" from arbitrary host memory to arbitrary host memory? (Recommendation: Yes.)
- Can the memory region to copy to overlap the memory region to copy from? (Recommendation: No?)
Do we want to support migrating to devices other than the device associated with command_queue?

UNRESOLVED: We could add an explicit dst_device argument if desired, which could be NULL when migrating to the device associated with the command_queue. We could also add a mechanism to allow migrating to the host.
Should clEnqueueMigrateMemINTEL support migrating an array of pointers with one API call, similar to clEnqueueSVMMigrateMem?

UNRESOLVED: This depends how frequently the migrate APIs are called.
Could the device argument to clSharedMemAllocINTEL be NULL if there is no need to associate the shared allocation to a specific device?

RESOLVED: Yes, this case is documented in revision G.
Should we allow querying the associated device for a USM allocation using clGetMemAllocInfoINTEL?

RESOLVED: This query was added in revision G.
Should we add explicit mem alloc flags for CACHED and UNCACHED?

UNRESOLVED:
At least for HOST and SHARED allocations, should we have separate mem alloc flags for the host and the device?

UNRESOLVED:
What are invalid values for ptr and size for clEnqueueMigrateMemINTEL and clEnqueueMemAdviseINTEL? How about clEnqueueMemFillINTEL and clEnqueueMemcpyINTEL? Specifically, is NULL a valid value for ptr? Is size equal to zero valid?

UNRESOLVED:
Should we add a device query for a maximum supported USM alignment, or should the maximum supported alignment implicitly be defined by the size of the largest data type supported by the device? Should we allow implementation-defined behavior for alignments larger than the size of the largest data type supported by the device?

UNRESOLVED: A device query would allow for larger supported alignments, such as page alignment. Note that supported alignments should always be a power of two.

Note that there are no maximum supported alignments defined for posix_memalign or _aligned_alloc, and supported alignments for the standard aligned_alloc and std::aligned_alloc are implementation-defined.
Should we add a device query for a maximum supported USM fill pattern size, or should the maximum supported fill pattern size implicitly be defined by the size of the largest data type supported by the device?

UNRESOLVED: A device query would allow for larger fill patterns. Note that the fill pattern size should always be a power of two.
Can a pointer to a device, host, or shared USM allocation be used to create a cl_mem using CL_MEM_USE_HOST_PTR?

UNRESOLVED: Trending "no" in all cases. If the USM allocation is from the same context this could be an error, such as CL_INVALID_HOST_PTR. If the USM allocation is from a different context then behavior could be undefined.
Can a pointer to a device, host, or shared USM allocation be used to create a cl_mem buffer using CL_MEM_COPY_HOST_PTR?

UNRESOLVED: Trending "no" for device and shared USM allocations. If the USM allocation is from the same context this could be an error, such as CL_INVALID_HOST_PTR. If the USM allocation is from a different context then behavior could be undefined.

Trending "yes" for host USM allocations, both when the host USM allocation is from this context and from another context.
Can a pointer to a device, host, or shared USM allocation be passed to API functions to read from or write to cl_mem objects, such as clEnqueueReadBuffer or clEnqueueWriteImage?

UNRESOLVED: Trending "yes" for device USM allocations, so long as the device USM allocation is accessible by the device associated with the command-queue, and the device allocation was made against the context associated with the command-queue.

Trending "yes" for host USM allocations, both when the host USM allocation is from this context and from another context.

Trending "no" for shared USM allocations. If the shared USM allocation is from the same context this could be an error, such as CL_INVALID_HOST_PTR. If the shared USM allocation is from a different context then behavior could be undefined.
Can a pointer to a device, host, or shared USM allocation be passed to API functions to fill a cl_mem, SVM allocation, or USM allocation, such as clEnqueueFillBuffer?

UNRESOLVED: Trending "no" for device and shared allocations. If the USM allocation is from the same context this could be an error, such as CL_INVALID_HOST_PTR. If the USM allocation is from a different context then behavior could be undefined.

Trending "yes" for host USM allocations, both when the host USM allocation is from this context and from another context.
Should we support passing traditional cl_mem_flags via the USM allocation properties?

UNRESOLVED: Trending "yes", by allowing CL_MEM_FLAGS as a property and cl_mem_flags as the property value.

Note that some flags will not be valid, such as CL_MEM_USE_HOST_PTR.
Exactly how does Unified Shared Memory affect the memory model?

UNRESOLVED:
Should it be an error to set an unknown pointer as a kernel argument using clSetKernelArgMemPointerINTEL if no devices support shared system allocations?

RESOLVED: The behavior of clSetKernelArgMemPointerINTEL was changed in version 1.1.0 of this extension.

Prior to version 1.1.0, it was considered an error to set an arbitrary pointer value as an argument to a kernel if no devices support system USM. This was helpful to identify possible programming errors, however it did not match the behavior of passing a pointer to a function on the host, where it is only a programming error if an invalid pointer is dereferenced. To provide a similar programming experience, the error condition was relaxed in version 1.1.0, and any arbitrary pointer value may be passed to a kernel.

The behavior was also changed for clSetKernelExecInfo(CL_KERNEL_EXEC_INFO_USM_PTRS_INTEL), similarly.

If desired, additional checks to identify possible programming errors may still be provided via optional USM checking layers, such as the USMChecking functionality in the OpenCL Intercept Layer.
Should we support a 2D "rect" memcpy similar to clEnqueueCopyBufferRect?

UNRESOLVED: This would be a fairly straightforward addition if it is useful.

Note that there is no similar 2D "rect" memcpy for SVM.

We could also support a 2D "rect" fill or memset, though there are no similar functions for cl_mem buffers or SVM.
Should there be an upper limit on the size of an allocation using clHostMemAllocINTEL? If so, what should the upper limit be?
UNRESOLVED: The upper limit is currently defined by CL_DEVICE_MAX_MEM_ALLOC_SIZE and if the allocation size exceeds this value then clHostMemAllocINTEL returns CL_INVALID_BUFFER_SIZE.

This behavior is consistent with clSVMAlloc (although clSVMAlloc does not return an error code it is specified to return a NULL pointer in this case) and clCreateBuffer. However, because clHostMemAllocINTEL is intended to allocate host memory, some implementations are able to support larger allocation sizes using clHostMemAllocINTEL.

Possible resolutions:
- Add a new query representing the maximum host memory allocation size supported by the device, e.g. CL_DEVICE_MAX_HOST_MEM_ALLOC_SIZE_INTEL. For some devices, this query will return the same value as CL_DEVICE_MAX_MEM_ALLOC_SIZE, but for other devices this query will return a larger value.
- Relax the error behavior so implementations may return CL_INVALID_BUFFER_SIZE, but they would not be required to return an error if they support larger allocation sizes.
- Do nothing and keep the existing error behavior.
Should it be an error to allocate zero bytes?
UNRESOLVED: Currently, attempting to allocate zero bytes fails and returns CL_INVALID_BUFFER_SIZE. This is consistent with SVM, where clSVMAlloc fails and returns a NULL pointer if the size to allocate is zero. It is also consistent with CUDA, where cuMemAlloc, etc. returns an error if the size to allocate is zero.

However, it is not necessarily consistent with other memory allocation functions. For example:
- The result of calling malloc(0) is implementation-defined: it can either return a NULL pointer or a unique non-null pointer that must be freed. If a NULL pointer is returned then errno may be set to an implementation-defined value. If a unique non-null pointer is returned then it cannot be dereferenced.
- Allocating an array of zero elements using new must return a non-null pointer, though dereferencing the pointer is undefined.
Possible resolutions:
- Allow zero-sized allocations and require returning a non-null pointer that must be freed.
- Allow zero-sized allocations but allow returning a NULL pointer. No error would be generated, even if a NULL pointer is returned.
- Specify that this case is implementation-defined.
- Do nothing and keep the existing error behavior.
Can a device USM allocation for a parent device be accessed by its sub-devices? Can a single device shared USM allocation associated with a parent device be accessed by its sub-devices?

UNRESOLVED: Since a sub-device is a partition of a parent device a USM allocation against a parent device should be accessible by its sub-devices. We could document this expectation explicitly in this extension if it is not already covered by the main OpenCL specification.

Note that a USM allocation against a sub-device need not be accessible by its parent device or by other sibling sub-devices, though some implementations may support this, just like some implementations optionally support access to USM allocations from other devices.

Revision History

Rev	Date	Author	Changes
1.0.0	2021-11-07	Ben Ashbaugh	Added version and other minor updates prior to posting on the OpenCL registry.
1.0.0	2022-11-08	Ben Ashbaugh	Added new issues regarding error behavior for clSetKernelArgMemPointerINTEL and rect copies.
1.0.1	2023-08-28	Ben Ashbaugh	Documented error conditions for clSetKernelExecInfo.
1.1.0	2024-07-30	Ben Ashbaugh	Modified error behavior for clSetKernelArgMemPointerINTEL and clSetKernelExecInfo.