Name Strings cl_intel_exec_by_local_thread Contributors Yariv Aridor Evgeny Fiksman Doron Singer Contact Yariv.Aridor(at)Intel.Com IP Status N/A Version Version 2, August 27, 2015 Number OpenCL Extension #16 Status Complete. Extension Type OpenCL device extension Dependencies OpenCL 1.0 is required Overview This extension allows the user to execute OpenCL tasks and kernels with the user application's threads. The extension defines a token that can be passed to clCreateCommandQueue, creating a queue with the "thread local exec" capability. All enqueue APIs (e.g., clEnqueueRead) submitted to such a queue never enqueue commands. An Enqueue API call is executed by the caller host-thread itself without involving any of the OpenCL runtime threads, much like function calls. The queue would typically stay empty - the queue handler argument is used only for compatibility of the enqueue APIs (i.e., keeping the same API signatures) and checking, at runtime, whether the enqueue API needs to be executed in such a fashion. Enqueue API calls on a "local" command queue can still use event dependency lists and output events. A non-NULL event dependency list will block the caller application thread until all the corresponding events are completed. Output events will be accessible only after the return of the enqueue API call (as with regular command queues) and should always have "completed" status (as expected). They might be useful for querying event status and profiling data. "Local" command queues behave the same regardless if they are defined as in-order or out-of-order. However, with in-order queues, threads may be blocked until execution of previously enqueued commands (by other threads) is completed. Given an enqueue API with a blocking flag (e.g., clEnqueueReadBuffer), both its blocking non-blocking calls behave the same and have the exact same impact on the application. Calls to clFinish and clFlush, clEnqueueBarrier and clEnqueueMarker are valid although meaningless for these command queues. An optimal implementation of these APIs for Immediate command queues will incur the minimal overhead of a function call and possibly an if-then-else to distinguish between a local command queue and a regular one. clEnqueueNDRangeCommand and clEnqueueTask should have optimized implementations using a single execution thread. Header File cl_ext.h New Procedures and Functions None New Tokens CL_QUEUE_THREAD_LOCAL_EXEC_ENABLE_INTEL (((cl_bitfield)1) << 31) Additions to Chapter 5 of the OpenCL 1.1 (v45) Specification Add CL_QUEUE_THREAD_LOCAL_EXEC_INTEL as a new property of command queues into table 5.1 (chapter 5) In chapter 5.1, add the following error codes to clCreateCommandQueue: CL_INVALID_QUEUE_PROPERTIES if the target device doesn't support the corresponding execution mode. And in a dedicated section to thread local execution, also in chapter 5.1, page 55: A host thread which fails to immediately invoke an enqueue API call because of out of stack memory should return CL_OUT_OF_HOST_MEMORY for this enqueue API call. The same host code should work, as is, with and without immediate command queues, except for code changes due to I. Potential deadlocks. These are the same type of deadlocks which may occur with current blocking API calls in OpenCL 1.1 II. Any optimizations which take into account the immediate execution of commands e.g., writing to a mapped buffer immediately after a call to clEnqueueMapBuffer without a follow-up call to clFinish. Interactions with other extensions None Issues N/A Sample Code cl_device_id device_id; cl_context context; cl_platform_id platform = 0; cl_int err = clGetPlatformIDs(1, &platform, NULL); cl_context_properties prop[3] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0 }; err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 1, &device_id, NULL); // create context context = clCreateContext(prop, 1, &device_id, NULL, NULL, &err); // create program with source cl_program program = clCreateProgramWithSource(context, 1, (const char**)&ocl_test_program, NULL, &err); err = clBuildProgram(program, NULL, NULL, NULL, NULL, NULL); cl_float src[IMMEDIATE_EXECUTION_GLOBAL_SIZE]; cl_float dst[IMMEDIATE_EXECUTION_GLOBAL_SIZE]; // ... Fill buffers // Create an in-order immediate queue cl_command_queue queue1 = clCreateCommandQueue (context, device_id, CL_QUEUE_THREAD_LOCAL_EXEC_ENABLE_INTEL, &err); // Create Kernel cl_kernel kernel1 = clCreateKernel(program, "copy", &err); // Create buffers size_t size = sizeof(cl_float); cl_mem buffer_src = clCreateBuffer(context, CL_MEM_READ_ONLY, size * IMMEDIATE_EXECUTION_GLOBAL_SIZE, NULL, &err); cl_mem buffer_dst = clCreateBuffer(context, CL_MEM_READ_WRITE, size * IMMEDIATE_EXECUTION_GLOBAL_SIZE, NULL, &err); // Set arguments err = clSetKernelArg(kernel1, 0, sizeof(cl_mem), &buffer_src); err = clSetKernelArg(kernel1, 1, sizeof(cl_mem), &buffer_dst); // Execute commands - Write buffers err = clEnqueueWriteBuffer (queue1, buffer_src, false, 0, size* IMMEDIATE_EXECUTION_GLOBAL_SIZE, src, 0, NULL, NULL); err = clEnqueueWriteBuffer (queue1, buffer_dst, false, 0, size* IMMEDIATE_EXECUTION_GLOBAL_SIZE, dst, 0, NULL, NULL); // Execute kernel size_t global_work_size[1] = { IMMEDIATE_EXECUTION_GLOBAL_SIZE }; size_t local_work_size[1] = { IMMEDIATE_EXECUTION_LOCAL_SIZE }; cl_event evt; err = clEnqueueNDRangeKernel(queue1, kernel1, 1, NULL, global_work_size, local_work_size, 0, NULL, &evt); err = clEnqueueReadBuffer (queue1, buffer_dst, CL_TRUE, 0, size*IMMEDIATE_EXECUTION_GLOBAL_SIZE, dst, 0, NULL, NULL); // release OpenCL objects Conformance Tests N/A Revision History Version 1, March 28, 2012 - Initial revision