Name Strings cl_ext_device_fission Contributors Greg Bellows, IBM Ben Gaster, AMD Aaftab Munshi, Apple Ian Ollmann, Apple Ofer Rosenberg, Intel Brian Watt, IBM Contact Ian Ollmann, iano 'at' apple 'dot' com IP Status No known IP issues Version Version 1, June 9, 2010 Number OpenCL Extension #11 Status Approved by all contributors Extension Type OpenCL device extension Dependencies None Overview This extension provides an interface for sub-dividing an OpenCL device into multiple sub-devices. There are a number of cases in which a typical user would like to subdivide a device: 1. To reserve part of the device for use for high priority / latency-sensitive tasks 2. To more directly control the assignment of work to individual compute units 3. To subdivide compute devices along some shared hardware feature like a cache Typically these are areas where some level of additional control is required to get optimal performance beyond that provided by standard OpenCL 1.1 APIs. Proper use of this interface assumes some detailed knowledge of the devices in question. Header File Extension interfaces and constants defined in cl_ext.h New Procedures and Functions cl_int clReleaseDeviceEXT( cl_device_id device ); cl_int clRetainDeviceEXT( cl_device_id device ); cl_int clCreateSubDevicesEXT( cl_device_id in_device, const cl_device_partition_property_ext * properties, cl_uint num_entries, cl_device_id *out_devices, cl_uint *num_devices ); New Types The following new types have been added for declaring device partitionining properties. typedef cl_bitfield cl_device_partition_property_ext; New Tokens Accepted as a property name in the parameter of clCreateSubDeviceEXT: CL_DEVICE_PARTITION_EQUALLY_EXT 0x4050 CL_DEVICE_PARTITION_BY_COUNTS_EXT 0x4051 CL_DEVICE_PARTITION_BY_NAMES_EXT 0x4052 CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT 0x4053 Accepted as a property name, when accompanying the CL_DEVICE_PARITION_BY_AFFINITY_DOMAIN_EXT property, in the parameter of clCreateSubDeviceEXT: CL_AFFINITY_DOMAIN_L1_CACHE_EXT 0x1 CL_AFFINITY_DOMAIN_L2_CACHE_EXT 0x2 CL_AFFINITY_DOMAIN_L3_CACHE_EXT 0x3 CL_AFFINITY_DOMAIN_L4_CACHE_EXT 0x4 CL_AFFINITY_DOMAIN_NUMA_EXT 0x10 CL_AFFINITY_DOMAIN_NEXT_FISSIONABLE_EXT 0x100 Accepted as a property being queried in the argument of clGetDeviceInfo: CL_DEVICE_PARENT_DEVICE_EXT 0x4054 CL_DEVICE_PARITION_TYPES_EXT 0x4055 CL_DEVICE_AFFINITY_DOMAINS_EXT 0x4056 CL_DEVICE_REFERENCE_COUNT_EXT 0x4057 CL_DEVICE_PARTITION_STYLE_EXT 0x4058 Accepted as the property list terminator in the parameter of clCreateSubDeviceEXT: CL_PROPERTIES_LIST_END_EXT 0x0 Accepted as the partition counts list terminator in the parameter of clCreateSubDeviceEXT: CL_PARTITION_BY_COUNTS_LIST_END_EXT 0x0 Accepted as the partition names list terminator in the parameter of clCreateSubDeviceEXT: CL_PARTITION_BY_NAMES_LIST_END_EXT -1 Returned by clCreateSubDevicesEXT when the indicated partition scheme is supported by the implementation, but the implementation can not further partition the device in this way. CL_DEVICE_PARTITION_FAILED_EXT -1057 Returned by clCreateSubDevicesEXT when the total number of compute units requested exceeds CL_DEVICE_MAX_COMPUTE_UNITS, or the number of compute units for any one sub-device is less than 1. CL_INVALID_PARTITION_COUNT_EXT -1058 Returned by clCreateSubDevicesEXT when a compute unit name appearing in a name list following CL_DEVICE_PARTITION_BY_NAMES_EXT is not in range. CL_INVALID_PARTITION_NAME_EXT -1059 Additions to Section 4.2 (Querying Devices) cl_int clReleaseDeviceEXT( cl_device_id device ); clReleaseDeviceEXT decrements the reference count. After the reference count reaches zero, the object shall be destroyed and associated resources released for reuse by the system. clReleaseDeviceEXT returns CL_SUCCESS if the function is executed successfully or the device is a root level device. It returns CL_INVALID_DEVICE if the is not a valid device. cl_int clRetainDeviceEXT( cl_device_id device ); clRetainDeviceEXT increments the reference count. clReleaseDeviceEXT returns CL_SUCCESS if the function is executed successfully or the device is a root level device. It returns CL_INVALID_DEVICE if the is not a valid device. CAUTION: Since root level devices are generally returned by a clGet call (clGetDeviceIDs) and not a clCreate call, the user generally does not own a reference count for root level devices. The reference count attached to a device retured from clGetDeviceIDs is owned by the implementation. Developers need to be careful when releasing cl_device_ids to always balance clCreateSubDevicesEXT or clRetainDeviceEXT with each call to clReleaseDeviceEXT for the device. By convention, software layers that own a reference count should be themselves responsible for releasing it. typedef cl_bitfield cl_device_partition_property_ext; cl_int clCreateSubDevicesEXT( cl_device_id in_device, const cl_device_partition_property_ext * properties, cl_uint num_entries, cl_device_id *out_devices, cl_uint *num_devices ); clCreateSubDevicesEXT creates an array of sub-devices that each reference a nonintersecting set of compute units within , according to a partition scheme given by the list. The output sub-devices may be used in every way that the root device can be used, including building programs, further calls to clCreateSubDevicesEXT and creating command queues. They may also be used within any context created using the in_device or parent/ancestor thereof. When a command queue is created against a sub-device, the commands enqueued on that queue are executed only on the sub-device. in_device - The device to be partitioned num_entries - The number of cl_device_ids that will fit in the array pointed to by . If is not NULL, must be greater than zero. out_devices - On output, the array pointed to by will contain up to sub-devices. If the argument is NULL, it is ignored. The number of cl_device_ids returned is the minimum of and the number of devices created by the partition scheme. num_devices - On output, the number of devices that the may be partitioned in to according to the partitioning scheme given by . If num_devices is NULL, it is ignored. properties - A zero terminated list of device fission {property-value, cl_int[]} pairs that describe how to partition the device into sub-devices. may not be NULL. Only one of CL_DEVICE_PARTITION_EQUALLY_EXT, CL_DEVICE_PARTITION_BY_COUNTS_EXT, CL_DEVICE_PARTITION_BY_NAMES_EXT or CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT may be used in the same properties list. Available properties are: CL_DEVICE_PARTITION_EQUALLY_EXT - Split the aggregate device into as many smaller aggregate devices as can be created, each containing N compute units. The value N is passed as the value accompanying this property. If N does not divide evenly into CL_DEVICE_MAX_COMPUTE_UNITS then the remaining compute units are not used. Example: To divide a device containing 16 compute units into two sub-devices, each containing 8 compute units, pass: { CL_DEVICE_PARTITION_EQUALLY_EXT, 8, CL_PROPERTIES_LIST_END_EXT } CL_DEVICE_PARTITION_BY COUNTS_EXT - This property is followed by a CL_PARTITION_BY_COUNTS_LIST_END_EXT terminated list of compute unit counts. For each non-zero count M in the list, a sub-device is created with M compute units in it. CL_PARTITION_BY_COUNTS_LIST_END_EXT is defined to be 0. Example: to split a four compute unit device into two sub-devices, each containing two compute units, pass: { CL_DEVICE_PARTITION_BY_COUNTS_EXT, 2, 2, CL_PARTITION_BY_COUNTS_LIST_END_EXT, CL_PROPERTIES_LIST_END_EXT } The first 2 means put two compute units in the first sub-device. The second 2 means put two compute units in the second sub-device. CL_PARTITION_BY_COUNTS_LIST_END_EXT terminates the list of sub-devices. CL_PROPERTIES_LIST_END_EXT terminates the list of properties. The total number of compute units specified may not exceed the number of compute units in the device. CL_DEVICE_PARTITION_BY NAMES_EXT - This property is followed by a list of compute unit names. Each list starts with a CL_PARTITION_BY_NAMES_LIST_END_EXT terminated list of compute unit names. Compute unit names are integers that count up from zero to the number of compute units less one. CL_PARTITION_BY_NAMES_LIST_END_EXT is defined to be -1. Only one sub-device may be created at a time with this selector. An individual compute unit name may not appear more than once in the sub-device description. Example: To create a three compute unit sub-device using compute units, { 0, 1, 3 }, pass: { CL_DEVICE_PARTITION_BY NAMES_EXT, 0, 1, 3, CL_PARTITION_BY_NAMES_LIST_END_EXT, CL_PROPERTIES_LIST_END_EXT } The meaning of these numbers are, in order: 0 the name of the first compute unit in the sub-device 1 the name of the second compute unit in the sub-device 3 the name of the third compute unit in the sub-device CL_PROPERTIES_LIST_END_EXT list terminator for the list of properties CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT - Split the device into smaller aggregate devices containing one or more compute units that all share part of a cache hierarchy. The value accompanying this property may be drawn from the following CL_AFFINITY_DOMAIN list: CL_AFFINITY_DOMAIN_NUMA_EXT - Split the device into sub-devices comprised of compute units that share a NUMA band. CL_AFFINITY_DOMAIN_L4_CACHE_EXT - Split the device into sub-devices comprised of compute units that share a level 4 data cache. CL_AFFINITY_DOMAIN_L3_CACHE_EXT - Split the device into sub-devices comprised of compute units that share a level 3 data cache. CL_AFFINITY_DOMAIN_L2_CACHE_EXT - Split the device into sub-devices comprised of compute units that share a level 2 data cache. CL_AFFINITY_DOMAIN_L1_CACHE_EXT - Split the device into sub-devices comprised of compute units that share a level 1 data cache. CL_AFFINITY_DOMAIN_NEXT_FISSIONABLE_EXT - Split the device along the next fissionable CL_AFFINITY_DOMAIN. The implementation shall find the first level along which the device or sub-device may be further subdivided in the order NUMA, L4, L3, L2, L1, and fission the device into sub-devices comprised of compute units that share memory sub-systems at this level. The user may determine what happened by calling clGetDeviceInfo(CL_DEVICE_PARTITION_STYLE_EXT) on the sub-devices. Example: To split a non-NUMA device along the outermost cache level (if any), pass: { CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT, CL_AFFINITY_DOMAIN_NEXT_FISSIONABLE_EXT, CL_PROPERTIES_LIST_END_EXT } CL_PROPERTIES_LIST_END_EXT - A list terminator for a properties list. The following values may be returned by clCreateSubDevicesEXT: CL_SUCCESS - The command succeeded. CL_INVALID_VALUE - The properties key is unknown, or the indicated partition style (CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT, CL_DEVICE_PARTITION_EQUALLY_EXT, CL_DEVICE_PARTITION_BY NAMES_EXT or CL_DEVICE_PARTITION_BY COUNTS_EXT) is not supported for this device by the implementation. On an OpenCL 1.1 implementation, these cases return CL_INVALID_PROPERTY instead, to be consistent with clCreateContext behavior. CL_INVALID_VALUE - num_entries is zero and out_devices is not NULL, or both out_devices and num_devices are NULL. CL_DEVICE_PARTITION_FAILED_EXT - The indicated partition scheme is supported by the implementation, but the implementation can not further partition the device in this way. For example, CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT was requested, but all compute units in in_device share the same cache at the level requested. CL_INVALID_PARTITION_COUNT_EXT - The total number of compute units requested exceeds CL_DEVICE_MAX_COMPUTE_UNITS, or the number of compute units for any one sub-device is less than 1, or the number of sub-devices requested exceeds CL_DEVICE_MAX_COMPUTE_UNITS. CL_INVALID_PARTITION_NAME_EXT - A compute unit name appearing in a name list following CL_DEVICE_PARTITION_BY NAMES_EXT is not in the range [-1, number of compute units - 1]. CL_INVALID_DEVICE - The in_device is not a valid device. The in_device is not a device in context. Clarification of Behavior of Existing Functions with Sub-Devices (Normative) In a style consistent with retain / release behavior of other OpenCL objects, if a new object is created that references a cl_device_id, or comes to reference it through an API such as clBuildProgram, the cl_device_id shall be prevented from being destroyed until such time as the new object no longer needs it. Changes to Section 4.2 (Querying Devices) clGetDeviceIDs - Only root level devices shall be returned by this function. Devices returned by this function should not contain compute units that alias compute units on another returned in the same function call. Compute units in root level devices shall not alias compute units in other root level devices. clGetDeviceInfo - If the device is a sub-device created by clCreateSubDevicesEXT, then the value returned for CL_DEVICE_MAX_COMPUTE_UNITS is the number of compute units in the sub-device. The CL_DEVICE_VENDOR_ID may be different from the parent device CL_DEVICE_VENDOR_ID, but should be the same for all devices and sub-devices that can share a binary executable, such as that returned from clGetProgramInfo(CL_PROGRAM_BINARIES). Other selectors such as CL_DEVICE_GLOBAL_MEM_CACHE_SIZE may optionally change value to better reflect the behavior of the sub-device in an implementation defined manner. The following selectors are added for clGetDeviceInfo: CL_DEVICE_PARENT_DEVICE_EXT - a selector to get the cl_device_id for the parent cl_device_id to which the sub-device belongs. (Sub-division can be multi-level.) If the device is a root level device, then it will return NULL. CL_DEVICE_PARTITION_TYPES_EXT - a selector to get a list of supported partition types for partitioning a device. The return type is an array of cl_device partition property ext values drawn from the following list: CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT CL_DEVICE_PARTITION_BY COUNTS_EXT CL_DEVICE_PARTITION_BY NAMES_EXT CL_DEVICE_PARTITION_EQUALLY_EXT The implementation shall return at least one property from the above list. However, when a partition style is found within this list, the partition style is not required to work in every case. For example, a device might support partitioning by affinity domain, but not along NUMA domains. CL_DEVICE_AFFINITY_DOMAINS_EXT - a selector to get a list of supported affinity domains for partitioning the device using the CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT partition style. The return type is an array of cl_device_partition_property_ext values. The values shall come from the list: CL_AFFINITY_DOMAIN_L1_CACHE_EXT CL_AFFINITY_DOMAIN_L2_CACHE_EXT CL_AFFINITY_DOMAIN_L3_CACHE_EXT CL_AFFINITY_DOMAIN_L4_CACHE_EXT CL_AFFINITY_DOMAIN_NUMA_EXT If no partition style is supported then the size of the returned array is zero. Even though a device has a NUMA, or particular cache level, an implementation may elect not to provide fissioning at that level. CL_DEVICE_REFERENCE_COUNT_EXT Return the device reference count. The return type is cl_uint. If the device is a root level device, a reference count of 1 is returned. CL_DEVICE_PARTITION_STYLE_EXT - a selector to get the cl_device_partition_property_ext list used to create the sub-device. If the device is a root level device then a list consisting of { CL_PROPERTIES_LIST_END_EXT} is returned. If the property on device creation was (CL_DEVICE_PARTITION BY_AFFINITY_DOMAIN_EXT, CL_AFFINITY_DOMAIN_NEXT_FISSIONABLE) then CL_AFFINITY_DOMAIN_NEXT_FISSIONABLE will be replaced by the symbol representing the actual CL_AFFINITY DOMAIN used (e.g. CL_AFFINITY_DOMAIN_NUMA). The returned value is an array of cl_device_partition_property_ext. The length of the array is obtained from the size returned by the param size value ret parameter to the function. Changes to Section 4.3 (Contexts) clCreateContext - The new cl_context only owns a reference to the devices explicitly passed to it at creation. It does not reference sub-devices of those devices unless they were also passed in the device list at creation time. clCreateContextFromType - Only root level devices are used to create the context. The context does not reference sub-devices created from those devices. clGetContextInfo - Only the device list passed into clCreateContext is returned by CL_CONTEXT_DEVICES. Additional sub-devices usable in the context, because their parent devices were used to create the devices, are not returned here. Changes to Section 5.1 (Command Queues) clCreateCommandQueue - This call will return CL_INVALID_DEVICE if the device or one of its ancestors was not used to create the context. On success, the cl_command_queue that is created for a sub-device will attempt to execute any work enqueued on it on just the sub-device. clGetCommandQueueInfo - If the command queue was created with a sub-device, the sub-device is returned for the CL_QUEUE_DEVICE selector. Changes to Section 5.4.1 (Creating Program Objects) clCreateProgramWithBinary - This call will return CL_INVALID_DEVICE if a device or one of its ancestors was not used to create the programs context. Changes to Section 5.4.2 (Building Program Executables) clBuildProgram - This call will return CL_INVALID_DEVICE if the device or one of its ancestors was not used to create the programs context. clBuildProgram must be called for each device or sub-device that will be used with the program. This function will return CL_INVALID_DEVICE if the device was not used to create the program and the program was created with clCreateProgramWithBinary. Implementations should eliminate redundant compilation where devices and sub-devices can share a common binary. Changes to Section 5.4.5 (Program Object Queries) clGetProgramInfo - Behavior depends on how the program was created: clCreateProgramWithSource - CL_PROGRAM_DEVICES returns the union of the set of devices used to create the cl_context to which the program belongs and the set of devices for which clBuildProgram() has succeeded on this program. clCreateProgramWithBinary - CL_PROGRAM_DEVICES returns the union of the set of devices used to create the program and the set of devices for which clBuildProgram() has succeeded on this program. In each case, CL_PROGRAM_BINARIES and CL_PROGRAM_BINARY_SIZES returns the same list in the same order, except that pointers to compiled binaries or their sizes are returned instead, respectively. The user is responsible for preventing the obvious race condition here, in which clBuildProgram() succeeds on more devices between when CL_PROGRAM_DEVICES, CL_PROGRAM_BINARIES and CL_PROGRAM_BINARY SIZES are requested.