Loading... please wait.

Copyright (c) 2011-2023 The Khronos Group, Inc.

This Specification is protected by copyright laws and contains material proprietary to Khronos. Except as described by these terms, it or any components may not be reproduced, republished, distributed, transmitted, displayed, broadcast or otherwise exploited in any manner without the express prior written permission of Khronos. Khronos grants a conditional copyright license to use and reproduce the unmodified Specification for any purpose, without fee or royalty, EXCEPT no licenses to any patent, trademark or other intellectual property rights are granted under these terms.

Khronos makes no, and expressly disclaims any, representations or warranties, express or implied, regarding this Specification, including, without limitation: merchantability, fitness for a particular purpose, non-infringement of any intellectual property, correctness, accuracy, completeness, timeliness, and reliability. Under no circumstances will Khronos, or any of its Promoters, Contributors or Members, or their respective partners, officers, directors, employees, agents or representatives be liable for any damages, whether direct, indirect, special or consequential damages for lost revenues, lost profits, or otherwise, arising from or in connection with these materials.

This Specification has been created under the Khronos Intellectual Property Rights Policy, which is Attachment A of the Khronos Group Membership Agreement available at https://www.khronos.org/files/member_agreement.pdf, and which defines the terms 'Scope', 'Compliant Portion', and 'Necessary Patent Claims'. Parties desiring to implement the Specification and make use of Khronos trademarks in relation to that implementation, and receive reciprocal patent license protection under the Khronos Intellectual Property Rights Policy must become Adopters and confirm the implementation as conformant under the process defined by Khronos for this Specification; see https://www.khronos.org/adopters.

Some parts of this Specification are purely informative and so are EXCLUDED from the Scope of this Specification.

Where this Specification uses technical terminology, defined in the Glossary or otherwise, that refer to enabling technologies that are not expressly set forth in this Specification, those enabling technologies are EXCLUDED from the Scope of this Specification. For clarity, enabling technologies not disclosed with particularity in this Specification (e.g. semiconductor manufacturing technology, hardware architecture, processor architecture or microarchitecture, memory architecture, compiler technology, object oriented technology, basic operating system technology, compression technology, algorithms, and so on) are NOT to be considered expressly set forth; only those application program interfaces and data structures disclosed with particularity are included in the Scope of this Specification.

For purposes of the Khronos Intellectual Property Rights Policy as it relates to the definition of Necessary Patent Claims, all recommended or optional features, behaviors and functionality set forth in this Specification, if implemented, are considered to be included as Compliant Portions.

Where this Specification includes normative references to external documents, only the specifically identified sections of those external documents are INCLUDED in the Scope of this Specification. If not created by Khronos, those external documents may contain contributions from non-members of Khronos not covered by the Khronos Intellectual Property Rights Policy.

This document contains extensions which are not ratified by Khronos, and as such is not a ratified Specification, though it contains text from (and is a superset of) the ratified SYCL Specification. The ratified version of the SYCL Specification can be found at https://www.khronos.org/registry/SYCL .

Khronos and Vulkan are registered trademarks, and SPIR-V is a trademark of The Khronos Group Inc. OpenCL is a trademark of Apple Inc. and OpenGL is a registered trademarks of Hewlett Packard Enterprise, all used under license by Khronos. All other product names, trademarks, and/or company names are used solely for identification and belong to their respective owners.

1. Acknowledgements

Editors

  • Maria Rovatsou, Codeplay

  • Lee Howes, Qualcomm

  • Ronan Keryell, AMD (current)

Contributors

  • Eric Berdahl, Adobe

  • Shivani Gupta, Adobe

  • David Neto, Altera

  • Carlo Bertolli, AMD

  • Andrew Gozillon, AMD

  • Gauthier Harnisch, AMD

  • Ronan Keryell, AMD

  • Yiannis Papadopoulos, AMD

  • Brian Sumner, AMD

  • Lin-Ya Yu, AMD

  • Thomas Applencourt, Argonne National Laboratory

  • Hal Finkel, Argonne National Laboratory

  • Kevin Harms, Argonne National Laboratory

  • Nevin Liber, Argonne National Laboratory

  • Anastasia Stulova, ARM

  • Balázs Keszthelyi, Broadcom

  • Alexandra Crabb, Caster Communications

  • Stuart Adams, Codeplay

  • Verena Beckham, Codeplay

  • Aidan Belton, Codeplay

  • Gordon Brown, Codeplay

  • Morris Hafner, Codeplay

  • Alexander Johnston, Codeplay

  • Marios Katsigiannis, Codeplay

  • Paul Keir, Codeplay

  • Steffen Larsen, Codeplay

  • Victor Lomüller, Codeplay

  • Tomas Matheson, Codeplay

  • Duncan McBain, Codeplay

  • Nicolas Miller, Codeplay

  • Georgi Mirazchiyski, Codeplay

  • Ralph Potter, Codeplay

  • Ruyman Reyes, Codeplay

  • Andrew Richards, Codeplay

  • Maria Rovatsou, Codeplay

  • Panagiotis Stratis, Codeplay

  • Michael Wong, Codeplay

  • Peter Žužek, Codeplay

  • Matt Newport, EA

  • Rasool Maghareh, Huawei Technologies Co. Ltd.

  • Guansong Zhang, Huawei Technologies Co. Ltd.

  • Ruslan Arutyunyan, Intel

  • Alexey Bader, Intel

  • James Brodman, Intel

  • Ilya Burylov, Intel

  • Jessica Davies, Intel

  • Felipe de Azevedo Piovezan, Intel

  • Allen Hux, Intel

  • Michael Kinsner, Intel

  • Greg Lueck, Intel

  • John Pennycook, Intel

  • Roland Schulz, Intel

  • Sergey Semenov, Intel

  • Jason Sewall, Intel

  • James O’Riordon, Khronos

  • Jon Leech, Luna Princeps LLC

  • Kathleen Mattson, Miller & Mattson, LLC

  • Dave Miller, Miller & Mattson, LLC

  • Stéphanie Even, Mercedes-Benz Research and Development NA

  • Chris Gearing, Mobileye

  • Seiji Nishimura, NSITEXE, Inc.

  • Neil Trevett, NVIDIA

  • Lee Howes, Qualcomm

  • Chu-Cheow Lim, Qualcomm

  • Jack Liu, Qualcomm

  • Hongqiang Wang, Qualcomm

  • Ruihao Zhang, Qualcomm

  • Dave Airlie, Red Hat

  • Hyesun Hong, Samsung Electronics

  • Aksel Alpay, Self

  • Dániel Berényi, Self

  • Máté Nagy-Egri, Stream HPC

  • Bálint Soproni, Stream HPC

  • Tom Deakin, University of Bristol

  • Philip Salzmann, University of Innsbruck

  • Peter Thoman, University of Innsbruck

  • Biagio Cosenza, University of Salerno

  • Paul Preney, University of Windsor

2. Introduction

SYCL (pronounced “sickle”) is a royalty-free, cross-platform abstraction C++ programming model for heterogeneous computing. SYCL builds on the underlying concepts, portability and efficiency of parallel API or standards like OpenCL while adding much of the ease of use and flexibility of single-source C++.

Developers using SYCL are able to write standard modern C++ code, with many of the techniques they are accustomed to, such as inheritance and templates. At the same time, developers have access to the full range of capabilities of the underlying implementation (such as OpenCL) both through the features of the SYCL libraries and, where necessary, through interoperation with code written directly using the underneath implementation, via their APIs.

To reduce programming effort and increase the flexibility with which developers can write code, SYCL extends the concepts found in standards like OpenCL model in a few ways beyond the general use of C++ features:

  • execution of parallel kernels on a heterogeneous device is made simultaneously convenient and flexible. Common parallel patterns are prioritized with simple syntax, which through a series C++ types allow the programmer to express additional requirements, such as synchronization, if needed;

  • when using buffers and accessors, data access in SYCL is separated from data storage. By relying on the C++-style resource acquisition is initialization (RAII) idiom to capture data dependencies between device code blocks, the runtime library can track data movement and provide correct behavior without the complexity of manually managing event dependencies between kernel instances and without the programmer having to explicitly move data. This approach enables the data-parallel task-graphs that might be already part of the execution model to be built up easily and safely by SYCL programmers;

  • Unified Shared Memory (USM) provides a mechanism for explicit data allocation and movement. This approach enables the use of pointer-based algorithms and data structures on heterogeneous devices, and allows for increased re-use of code across host and device;

  • the hierarchical parallelism syntax offers a way of expressing data parallelism similar to the OpenCL device or OpenMP target device execution model in an easy-to-understand modern C++ form. It more cleanly layers parallel loops and synchronization points to avoid fragmentation of code and to more efficiently map to CPU-style architectures.

SYCL retains the execution model, runtime feature set and device capabilities inspired by the OpenCL standard. This standard imposes some limitations on the full range of C++ features that SYCL is able to support. This ensures portability of device code across as wide a range of devices as possible. As a result, while the code can be written in standard C++ syntax with interoperability with standard C++ programs, the entire set of C++ features is not available in SYCL device code. In particular, SYCL device code, as defined by this specification, does not support virtual function calls, function pointers in general, exceptions, runtime type information or the full set of C++ libraries that may depend on these features or on features of a particular host compiler. Nevertheless, these basic restrictions can be relieved by some specific Khronos or vendor extensions.

SYCL implements an SMCP design which offers the power of source integration while allowing toolchains to remain flexible. The SMCP design supports embedding of code intended to be compiled for a device, for example a GPU, inline with host code. This embedding of code offers three primary benefits:

Simplicity

For novice programmers using frameworks like OpenCL, the separation of host and device source code in OpenCL can become complicated to deal with, particularly when similar kernel code is used for multiple different operations on different data types. A single compiler flow and integrated tool chain combined with libraries that perform a lot of simple tasks simplifies initial OpenCL programs to a minimum complexity. This reduces the learning curve for programmers new to heterogeneous programming and allows them to concentrate on parallelization techniques rather than syntax.

Reuse

C++'s type system allows for complex interactions between different code units and supports efficient abstract interface design and reuse of library code. For example, a transform or map operation applied to an array of data may allow specialization on both the operation applied to each element of the array and on the type of the data. The SMCP design of SYCL enables this interaction to bridge the host code/device code boundary such that the device code to be specialized on both of these factors directly from the host code.

Efficiency

Tight integration with the type system and reuse of library code enables a compiler to perform inlining of code and to produce efficient specialized device code based on decisions made in the host code without having to generate kernel source strings dynamically.

The use of C++ features such as generic programming, templated code, functional programming and inheritance on top of existing heterogeneous execution model opens a wide scope for innovation in software design for heterogeneous systems. Clean integration of device and host code within a single C++ type system enables the development of modern, templated generic and adaptable libraries that build simple, yet efficient, interfaces to offer more developers access to heterogeneous computing capabilities and devices. SYCL is intended to serve as a foundation for innovation in programming models for heterogeneous systems, that builds on open and widely implemented standard foundation like OpenCL or Vulkan.

SYCL is designed to be as close to standard C++ as possible. In practice, this means that as long as no dependence is created on SYCL’s integration with the underlying implementation, a standard C++ compiler can compile SYCL programs and they will run correctly on a host CPU. Any use of specialized low-level features can be masked using the C preprocessor in the same way that compiler-specific intrinsics may be hidden to ensure portability between different host compilers.

SYCL is designed to allow a compilation flow where the source file is passed through multiple different compilers, including a standard C++ host compiler of the developer’s choice, and where the resulting application combines the results of these compilation passes. This is distinct from a single-source flow that might use language extensions that preclude the use of a standard host compiler. The SYCL standard does not preclude the use of a single compiler flow, but is designed to not require it. SYCL can also be implemented purely as a library, in which case no special compiler support is required at all.

The advantages of this design are two-fold. First, it offers better integration with existing tool chains. An application that already builds using a chosen compiler can continue to do so when SYCL code is added. Using the SYCL tools on a source file within a project will both compile for a device and let the same source file be compiled using the same host compiler that the rest of the project is compiled with. Linking and library relationships are unaffected. This design simplifies porting of pre-existing applications to SYCL. Second, the design allows the optimal compiler to be chosen for each device where different vendors may provide optimized tool-chains.

To summarize, SYCL enables computational kernels to be written inside C++ source files as normal C++ code, leading to the concept of “single-source” programming. This means that software developers can develop and use generic algorithms and data structures using standard C++ template techniques, while still supporting multi-platform, multi-device heterogeneous execution. Access to the low level APIs of an underlying implementation (such as OpenCL) is also supported. The specification has been designed to enable implementation across as wide a variety of platforms as possible as well as ease of integration with other platform-specific technologies, thereby letting both users and implementers build on top of SYCL as an open platform for system-wide heterogeneous processing innovation.

3. SYCL architecture

This chapter describes the structure of a SYCL application, and how the SYCL generic programming model lays out on top of a number of SYCL backends.

3.1. Overview

SYCL is an open industry standard for programming a heterogeneous system. The design of SYCL allows standard C++ source code to be written such that it can run on either an heterogeneous device or on the host.

The terminology used for SYCL inherits historically from OpenCL with some SYCL-specific additions. However SYCL is a generic C++ programming model that can be laid out on top of other heterogeneous APIs apart from OpenCL. SYCL implementations can provide SYCL backends for various heterogeneous APIs, implementing the SYCL general specification on top of them. We refer to this heterogeneous API as the SYCL backend API. The SYCL general specification defines the behavior that all SYCL implementations must expose to SYCL users for a SYCL application to behave as expected.

A function object that can execute on a device exposed by a SYCL backend API is called a SYCL kernel function.

To ensure maximum interoperability with different SYCL backend APIs, software developers can access the SYCL backend API alongside the SYCL general API whenever they include the SYCL backend interoperability headers. However, interoperability is a SYCL backend-specific feature. An application that uses interoperability does not conform to the SYCL general application model, since it is not portable across backends.

The target users of SYCL are C++ programmers who want all the performance and portability features of a standard like OpenCL, but with the flexibility to use higher-level C++ abstractions across the host/device code boundary. Developers can use most of the abstraction features of C++, such as templates, classes and operator overloading.

However, some C++ language features are not permitted inside kernels, due to the limitations imposed by the capabilities of the underlying heterogeneous platforms. These features include virtual functions, virtual inheritance, throwing/catching exceptions, and run-time type-information. These features are available outside kernels as normal. Within these constraints, developers can use abstractions defined by SYCL, or they can develop their own on top. These capabilities make SYCL ideal for library developers, middleware providers and application developers who want to separate low-level highly-tuned algorithms or data structures that work on heterogeneous systems from higher-level software development. Software developers can produce templated algorithms that are easily usable by developers in other fields.

3.2. Anatomy of a SYCL application

Below is an example of a typical SYCL application which schedules a job to run in parallel on any heterogeneous device available.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <iostream>
#include <sycl/sycl.hpp>
using namespace sycl; // (optional) avoids need for "sycl::" before SYCL names

int main() {
  int data[1024]; // Allocate data to be worked on

  // Create a default queue to enqueue work to the default device
  queue myQueue;

  // By wrapping all the SYCL work in a {} block, we ensure
  // all SYCL tasks must complete before exiting the block,
  // because the destructor of resultBuf will wait
  {
    // Wrap our data variable in a buffer
    buffer<int, 1> resultBuf { data, range<1> { 1024 } };

    // Create a command group to issue commands to the queue
    myQueue.submit([&](handler& cgh) {
      // Request write access to the buffer without initialization
      accessor writeResult { resultBuf, cgh, write_only, no_init };

      // Enqueue a parallel_for task with 1024 work-items
      cgh.parallel_for(1024, [=](id<1> idx) {
        // Initialize each buffer element with its own rank number starting at 0
        writeResult[idx] = idx;
      }); // End of the kernel function
    });   // End of our commands for this queue
  }       // End of scope, so we wait for work producing resultBuf to complete

  // Print result
  for (int i = 0; i < 1024; i++)
    std::cout << "data[" << i << "] = " << data[i] << std::endl;

  return 0;
}

At line 1, we #include the SYCL header files, which provide all of the SYCL features that will be used.

A SYCL application runs on a SYCL Platform. The application is structured in three scopes which specify the different sections; application scope, command group scope and kernel scope. The kernel scope specifies a single kernel function that will be, or has been, compiled by a device compiler and executed on a device. In this example kernel scope is defined by lines 25 to 26. The command group scope specifies a unit of work which is comprised of a SYCL kernel function and accessors. In this example command group scope is defined by lines 20 to 28. The application scope specifies all other code outside of a command group scope. These three scopes are used to control the application flow and the construction and lifetimes of the various objects used within SYCL, as explained in Section 3.9.12.

A SYCL kernel function is the scoped block of code that will be compiled using a device compiler. This code may be defined by the body of a lambda function or by the operator() function of a function object. Each instance of the SYCL kernel function will be executed as a single, though not necessarily entirely independent, flow of execution and has to adhere to restrictions on what operations may be allowed to enable device compilers to safely compile it to a range of underlying devices.

The parallel_for member function can be templated with a class. This class is used to manually name the kernel when desired, such as to avoid a compiler-generated name when debugging a kernel defined through a lambda, to provide a known name with which to apply build options to a kernel, or to ensure compatibility with multiple compiler-pass implementations.

The parallel_for member function creates an instance of a kernel, which is the entity that will be enqueued within a command group. In the case of parallel_for the SYCL kernel function will be executed over the given range from 0 to 1023. The different member functions to execute kernels can be found in Section 4.9.4.2.

A command group scope is the syntactic scope wrapped by the construction of a command group function object as seen on line 19. The command group function object may invoke only a single SYCL kernel function, and it takes a parameter of type command group handler, which is constructed by the runtime.

All the requirements for a kernel to execute are defined in this command group scope, as described in Section 3.7.1. In this case the constructor used for myQueue on line 9 is the default constructor, which allows the queue to select the best underlying device to execute on, leaving the decision up to the runtime.

In SYCL, data that is required within a SYCL kernel function must be contained within a buffer, image, or USM allocation, as described in Section 3.8. We construct a buffer on line 16. Access to the buffer is controlled via an accessor which is constructed on line 21. The buffer is used to keep track of access to the data and the accessor is used to request access to the data on a queue, as well as to track the dependencies between SYCL kernel function. In this example the accessor is used to write to the data buffer on line 26.

3.3. Normative references

The documents in the following list are referred to within this SYCL specification, and their content is a requirement for this document.

  1. C++17: ISO/IEC 14882:2017 Clauses 1-19, referred to in this specification as the C++ core language. The SYCL specification refers to language in the following C++ defect reports and assumes a compiler that implements them: DR2325.

  2. C++20: ISO/IEC 14882:2020 Programming languages — C++, referred to in this specification as the next C++ specification.

3.4. Non-normative notes and examples

Unless stated otherwise, text within this SYCL specification is normative and defines the required behavior of a SYCL implementation. Non-normative / informational notes are included within this specification using a “note” callout, of the form:

Information within a note callout, such as this text, is for informational purposes and does not impose requirements on or specify behavior of a SYCL implementation.

Source code examples within the specification are provided to aid with understanding, and are non-normative.

In case of any conflict between a non-normative note or source example, and normative text within the specification, the normative text must be taken to be correct.

3.5. The SYCL platform model

The SYCL platform model is based on the OpenCL platform model. The model consists of a host connected to one or more heterogeneous devices, called devices.

A SYCL context is constructed, either directly by the user or implicitly when creating a queue, to hold all the runtime information required by the SYCL runtime and the SYCL backend to operate on a device, or group of devices. When a group of devices can be grouped together on the same context, they have some visibility of each other’s memory objects. The SYCL runtime can assume that memory is visible across all devices in the same context. Not all devices exposed from the same platform can be grouped together in the same context.

A SYCL application executes on the host as a standard C++ program. Devices are exposed through different SYCL backends to the SYCL application. The SYCL application submits command group function objects to queues. Each queue enables execution on a given device.

The SYCL runtime then extracts operations from the command group function object, e.g. an explicit copy operation or a SYCL kernel function. When the operation is a SYCL kernel function, the SYCL runtime uses a SYCL backend-specific mechanism to extract the device binary from the SYCL application and pass it to the heterogeneous API for execution on the device.

A SYCL device is divided into one or more compute units (CUs) which are each divided into one or more processing elements (PEs). Computations on a device occur within the processing elements. How computation is mapped to PEs is SYCL backend and device specific. Two devices exposed via two different backends can map computations differently to the same device.

When a SYCL application contains SYCL kernel function objects, the SYCL implementation must provide an offline compilation mechanism that enables the integration of the device binaries into the SYCL application. The output of the offline compiler can be an intermediate representation, such as SPIR-V, that will be finalized during execution or a final device ISA.

A device may expose special purpose functionality as a built-in function. The SYCL API exposes functions to query and dispatch said built-in functions. Some SYCL backends and devices may not support programmable kernels, and only support built-in functions.

3.6. The SYCL backend model

SYCL is a generic programming model for the C++ language that can target multiple heterogeneous APIs, such as OpenCL.

SYCL implementations enable these target APIs by implementing SYCL backends. For a SYCL implementation to be conformant on said SYCL backend, it must execute the SYCL generic programming model on the backend. All SYCL implementations must provide at least one backend.

The present document covers the SYCL generic interface available to all SYCL backends. How the SYCL generic interface maps to a particular SYCL backend is defined either by a separate SYCL backend specification document, provided by the Khronos SYCL group, or by the SYCL implementation documentation. Whenever there is a SYCL backend specification document, this takes precedence over SYCL implementation documentation.

When a SYCL user builds their SYCL application, she decides which of the SYCL backends will be used to build the SYCL application. This is called the set of active backends. Implementations must ensure that the active backends selected by the user can be used simultaneously by the SYCL implementation at runtime. If two backends are available at compile time but will produce an invalid SYCL application at runtime, the SYCL implementation must emit a compilation error.

A SYCL application built with a number of active backends does not necessarily guarantee that said backends can be executed at runtime. The subset of active backends available at runtime is called available backends. A backend is said to be available if the host platform where the SYCL application is executed exposes support for the heterogeneous API required for the SYCL backend.

It is implementation dependent whether certain backends require third-party libraries to be available in the system. Failure to have all dependencies required for all active backends at runtime will cause the SYCL application to not run.

Once the application is running, users can query what SYCL platforms are available. SYCL implementations will expose the devices provided by each backend grouped into platforms. A backend must expose at least one platform.

Under the SYCL backend model, SYCL objects can contain one or multiple references to a certain SYCL backend native type. Not all SYCL objects will map directly to a SYCL backend native type. The mapping of SYCL objects to SYCL backend native types is defined by the SYCL backend specification document when available, or by the SYCL implementation otherwise.

To guarantee that multiple SYCL backend objects can interoperate with each other, SYCL memory objects are not bound to a particular SYCL backend. SYCL memory objects can be accessed from any device exposed by an available backend. SYCL Implementations can potentially map SYCL memory objects to multiple native types in different SYCL backends.

Since SYCL memory objects are independent of any particular SYCL backend, SYCL command groups can request access to memory objects allocated by any SYCL backend, and execute it on the backend associated with the queue. This requires the SYCL implementation to be able to transfer memory objects across SYCL backends.

USM allocations are subject to the limitations described in Section 4.8.

When a SYCL application runs on any number of SYCL backends without relying on any SYCL backend-specific behavior or interoperability, it is said to be a SYCL general application, and it is expected to run in any SYCL-conformant implementation that supports the required features for the application.

3.6.1. Platform mixed version support

The SYCL generic programming model exposes a number of platforms, each of them exposing a number of devices. Each platform is bound to a certain SYCL backend. SYCL devices associated with said platform are associated with that SYCL backend.

Although the APIs in the SYCL generic programming model are defined according to this specification and their version is indicated by the macro SYCL_LANGUAGE_VERSION, this does not apply to APIs exposed by the SYCL backends. Each SYCL backend provides its own document that defines its APIs, and that document tells how to query for the device and platform versions.

3.7. SYCL execution model

As described in Section 3.2, a SYCL application is comprised of three scopes: application scope, command group scope, and kernel scope. Code in the application scope and command group scope runs on the host and is governed by the SYCL application execution model. Code in the kernel scope runs on a device and is governed by the SYCL kernel execution model.

A SYCL device does not necessarily correspond to a physical accelerator. A SYCL implementation may choose to expose some or all of the host’s resources as a SYCL device; such an implementation would execute code in kernel scope on the host, but that code would still be governed by the SYCL kernel execution model.

3.7.1. SYCL application execution model

The SYCL application defines the execution order of the kernels by grouping each kernel with its requirements into a command group function object. Command group function objects are submitted for execution via a queue object, which defines the device where the kernel will run. This specification sometimes refers to this as “submitting the kernel to a device”. The same command group object can be submitted to different queues. When a command group is submitted to a SYCL queue, the requirements of the kernel execution are captured. The implementation can start executing a kernel as soon as its requirements have been satisfied.

3.7.1.1. SYCL backend resources managed by the SYCL application

The SYCL runtime integrated with the SYCL application will manage the resources required by the SYCL backend API to manage the heterogeneous devices it is providing access to. This includes, but is not limited to, resource handlers, memory pools, dispatch queues and other temporary handler objects.

The SYCL programming interface represents the lifetime of the resources managed by the SYCL application using RAII rules. Construction of a SYCL object will typically entail the creation of multiple SYCL backend objects, which will be properly released on destruction of said SYCL object. The overall rules for construction and destruction are detailed in Chapter 4. Those SYCL backends with a SYCL backend document will detail how the resource management from SYCL objects map down to the SYCL backend objects.

In SYCL, the minimum required object for submitting work to devices is the queue, which contains references to a platform, device and a context internally.

The resources managed by SYCL are:

  1. Platforms: all features of SYCL backend APIs are implemented by platforms. A platform can be viewed as a given vendor’s runtime and the devices accessible through it. Some devices will only be accessible to one vendor’s runtime and hence multiple platforms may be present. SYCL manages the different platforms for the user which are accessible through a sycl::platform object.

  2. Contexts: any SYCL backend resource that is acquired by the user is attached to a context. A context contains a collection of devices that the host can use and manages memory objects that can be shared between the devices. Devices belonging to the same context must be able to access each other’s global memory using some implementation-specific mechanism. A given context can only wrap devices owned by a single platform. A context is exposed to the user with a sycl::context object.

  3. Devices: platforms provide one or more devices for executing SYCL kernels. In SYCL, a device is accessible through a sycl::device object.

  4. Kernels: the SYCL functions that run on SYCL devices are defined as C++ function objects (a named function object type or a lambda function). A kernel can be introspected through a sycl::kernel object.

    Note that some SYCL backends may expose non-programmable functionality as pre-defined kernels.

  5. Kernel bundles: Kernels are stored internally in the SYCL application as device images, and these device images can be grouped into a sycl::kernel_bundle object. These objects provide a way for the application to control the online compilation of kernels for devices.

  6. Queues: SYCL kernels execute in command queues. The user must create a sycl::queue object, which references an associated context, platform and device. The context, platform and device may be chosen automatically, or specified by the user. SYCL queues execute kernels on a particular device of a particular context, but can have dependencies from any device on any available SYCL backend.

The SYCL implementation guarantees the correct initialization and destruction of any resource handled by the underlying SYCL backend API, except for those the user has obtained manually via the SYCL interoperability API.

3.7.1.2. SYCL command groups and execution order

By default, SYCL queues execute kernel functions in an out-of-order fashion based on dependency information. Developers only need to specify what data is required to execute a particular kernel. The SYCL runtime will guarantee that kernels are executed in an order that guarantees correctness. By specifying access modes and types of memory, a directed acyclic dependency graph (DAG) of kernels is built at runtime. This is achieved via the usage of command group objects. A SYCL command group object defines a set of requisites (R) and a kernel function (k). A command group is submitted to a queue when using the sycl::queue::submit member function.

A requisite (ri) is a requirement that must be fulfilled for a kernel-function (k) to be executed on a particular device. For example, a requirement may be that certain data is available on a device, or that another command group has finished execution. An implementation may evaluate the requirements of a command group at any point after it has been submitted. The processing of a command group is the process by which a SYCL runtime evaluates all the requirements in a given R. The SYCL runtime will execute k only when all ri are satisfied (i.e., when all requirements are satisfied). To simplify the notation, in the specification we refer to the set of requirements of a command group named foo as CGfoo = r1, …, rn.

The evaluation of a requisite (Satisfied(ri)) returns the status of the requisite, which can be True or False. A satisfied requisite implies the requirement is met. Satisfied(ri) never alters the requisite, only observes the current status. The implementation may not block to check the requisite, and the same check can be performed multiple times.

An action (ai) is a collection of implementation-defined operations that must be performed in order to satisfy a requisite. The set of actions for a given command group A is permitted to be empty if no operation is required to satisfy the requirement. The notation ai represents the action required to satisfy ri. Actions of different requisites can be satisfied in any order with respect to each other without side effects (i.e., given two requirements rj and rk, (rj, rk)(rk, rj)). The intersection of two actions is not necessarily empty. Actions can include (but are not limited to): memory copy operations, mapping operations, host side synchronization, or implementation-specific behavior.

Finally, Performing an action (Perform(ai)) executes the action operations required to satisfy the requisite rj. Note that, after Perform(ai), the evaluation Satisfied(rj) will return True until the kernel is executed. After the kernel execution, it is not defined whether a different command group with the same requirements needs to perform the action again, where actions of different requisites inside the same command group object can be satisfied in any order with respect to each other without side effects: Given two requirements rj and rk, Perform(aj) followed by Perform(ak) is equivalent to Perform(ak) followed by Perform(aj).

The requirements of different command groups submitted to the same or different queues are evaluated in the relative order of submission. command group objects whose intersection of requirement sets is not empty are said to depend on each other. They are executed in order of submission to the queue. If command groups are submitted to different queues or by multiple threads, the order of execution is determined by the SYCL runtime. Note that independent command group objects can be submitted simultaneously without affecting dependencies.

Table 1 illustrates the execution order of three command group objects (CGa,CGb,CGc) with certain requirements submitted to the same queue. Both CGa and CGb only have one requirement, r1 and r2 respectively. CGc requires both r1 and r2. This enables the SYCL runtime to potentially execute CGa and CGb simultaneously, whereas CGc cannot be executed until both CGa and CGb have been completed. The SYCL runtime evaluates the requisites and performs the actions required (if any) for the CGa and CGb. When evaluating the requisites of CGc, they will be satisfied once the CGa and CGb have finished.

Table 1. Execution order of three command groups submitted to the same queue
SYCL Application Enqueue Order SYCL Kernel Execution Order
sycl::queue syclQueue;
syclQueue.submit(CGa(r1));
syclQueue.submit(CGb(r2));
syclQueue.submit(CGc(r1,r2));
three cg one queue

Table 2 uses three separate SYCL queue objects to submit the same command group objects as before. Regardless of using three different queues, the execution order of the different command group objects is the same. When different threads enqueue to different queues, the execution order of the command group will be the order in which the submit member functions are executed. In this case, since the different command group objects execute on different devices, the actions required to satisfy the requirements may be different (e.g, the SYCL runtime may need to copy data to a different device in a separate context).

Table 2. Execution order of three command groups submitted to the different queues
SYCL Application Enqueue Order SYCL Kernel Execution Order
sycl::queue syclQueue1;
sycl::queue syclQueue2;
sycl::queue syclQueue3;
syclQueue1.submit(CGa(r1));
syclQueue2.submit(CGb(r2));
syclQueue3.submit(CGc(r1,r2));
three cg three queue
3.7.1.3. Controlling execution order with events

Submitting an action for execution returns an event object. Programmers may use these events to explicitly synchronize programs. Host code can wait for an event to complete, which will block execution on the host until the action represented by the event has completed. The event class is described in greater detail in Section 4.6.6.

Events may also be used to explicitly order the execution of kernels. Host code may wait for the completion of specific event, which blocks execution on the host until that event’s action has completed. Events may also define requisites between command groups. Using events in this manner informs the runtime that one or more command groups must complete before another command group may begin executing. See Section 4.9.4.1 for greater detail.

3.7.2. SYCL kernel execution model

When a kernel is submitted for execution, an index space is defined. An instance of the kernel body executes for each point in this index space. This kernel instance is called a work-item and is identified by its point in the index space, which provides a global id for the work-item. Each work-item executes the same code but the specific execution pathway through the code and the data operated upon can vary by using the work-item global id to specialize the computation.

An index space of size zero is allowed. All aspects of kernel execution proceed as normal with the exception that the kernel function itself is not executed. Note this means the command queue will still schedule this kernel after satisfying the requirements and this satisfies requirements of any dependent enqueued kernels.

3.7.2.1. Basic kernels

SYCL allows a simple execution model in which a kernel is invoked over an N-dimensional index space defined by range<N>, where N is one, two or three. Each work-item in such a kernel executes independently.

Each work-item is identified by a value of type item<N>. The type item<N> encapsulates a work-item identifier of type id<N> and a range<N> representing the number of work-items executing the kernel.

3.7.2.2. ND-range kernels

Work-items can be organized into work-groups, providing a more coarse-grained decomposition of the index space. Each work-group is assigned a unique work-group id with the same dimensionality as the index space used for the work-items. Work-items are each assigned a local id, unique within the work-group, so that a single work-item can be uniquely identified by its global id or by a combination of its local id and work-group id. The work-items in a given work-group execute on the processing elements of a single compute unit.

When work-groups are used in SYCL, the index space is called an nd-range. An ND-range is an N-dimensional index space, where N is one, two or three. In SYCL, the ND-range is represented via the nd_range<N> class. An nd_range<N> is made up of a global range and a local range, each represented via values of type range<N>. Additionally, there can be a global offset, represented via a value of type id<N>; this is deprecated in SYCL 2020. The types range<N> and id<N> are each N-element arrays of integers. The iteration space defined via an nd_range<N> is an N-dimensional index space starting at the ND-range’s global offset whose size is its global range, split into work-groups of the size of its local range.

Each work-item in the ND-range is identified by a value of type nd_item<N>. The type nd_item<N> encapsulates a global id, local id and work-group id, all of type id<N> (the iteration space offset also of type id<N>, but this is deprecated in SYCL 2020), as well as global and local ranges and synchronization operations necessary to make work-groups useful. Work-groups are assigned ids using a similar approach to that used for work-item global ids. Work-items are assigned to a work-group and given a local id with components in the range from zero to the size of the work-group in that dimension minus one. Hence, the combination of a work-group id and the local id within a work-group uniquely defines a work-item.

3.7.2.3. Backend-specific kernels

SYCL allows a SYCL backend to expose fixed functionality as non-programmable built-in kernels. The availability and behavior of these built-in kernels are SYCL backend-specific, and are not required to follow the SYCL execution and memory models. Furthermore the interface exposed utilize these built-in kernels is also SYCL backend-specific. See the relevant backend specification for details.

3.8. Memory model

Since SYCL is a single-source programming model, the memory model affects both the application and the device kernel parts of a program. On the SYCL application, the SYCL runtime will make sure data is available for execution of the kernels. On the SYCL device kernel, the SYCL backend rules describing how the memory behaves on a specific device are mapped to SYCL C++ constructs. Thus it is possible to program kernels efficiently in pure C++.

3.8.1. SYCL application memory model

The application running on the host uses SYCL buffer objects using instances of the sycl::buffer class or USM allocation functions to allocate memory in the global address space, or can allocate specialized image memory using the sycl::unsampled_image and sycl::sampled_image classes.

In the SYCL application, memory objects are bound to all devices in which they are used, regardless of the SYCL context where they reside. SYCL memory objects (namely, buffer and image objects) can encapsulate multiple underlying SYCL backend memory objects together with multiple host memory allocations to enable the same object to be shared between devices in different contexts, platforms or backends. USM allocations uniquely identify a memory allocation and are bound to a SYCL context. They are only valid on the backend used by the context.

The order of execution of command group objects ensures a sequentially consistent access to the memory from the different devices to the memory objects. Accessing a USM allocation does not alter the order of execution. Users must explicitly inform the SYCL runtime of any requirements necessary for a legal execution.

To access a memory object, the user must create an accessor object which parameterizes the type of access to the memory object that a kernel or the host requires. The accessor object defines a requirement to access a memory object, and this requirement is defined by construction of an accessor, regardless of whether there are any uses in a kernel or by the host. An accessor object specifies whether the access is via global memory, constant memory or image samplers and their associated access functions. The accessor also specifies whether the access is read-only (RO), write-only (WO) or read-write (RW). An optional no_init property can be added to an accessor to tell the system to discard any previous contents of the data the accessor refers to, so there are two additional requirement types: no-init-write-only (NWO) and no-init-read-write (NRW). For simplicity, when a requisite represents an accessor object in a certain access mode, we represent it as MemoryObjectAccessMode. For example, an accessor that accesses memory object buf1 in RW mode is represented as buf1RW. A command group object that uses such an accessor is represented as CG(buf1RW). The action required to satisfy a requisite and the location of the latest copy of a memory object will vary depending on the implementation.

Table 3 illustrates an example where command group objects are enqueued to two separate SYCL queues executing in devices in different contexts. The requisites for the command group execution are the same, but the actions to satisfy them are different. For example, if the data is on the host before execution, A(b1RW) and A(b2RW) can potentially be implemented as copy operations from the host memory to context1 or context2 respectively. After CGa and CGb are executed, A'(b1RW) will likely be an empty operation, since the result of the kernel can stay on the device. On the other hand, the results of CGb are now on a different context than CGc is executing, therefore A'(b2RW) will need to copy data across two separate contexts using an implementation specific mechanism.

Table 3. Actions performed when three command groups are submitted to two distinct queues
SYCL Application Enqueue Order SYCL Kernel Execution Order
sycl::queue q1(context1);
sycl::queue q2(context2);
q1.submit(CGa(b1RW));
q2.submit(CGb(b2RW));
q1.submit(CGc(b1RW,b2RW));
device to device1

Possible implementation by a SYCL Runtime

device to device2

Table 3 shows actions performed when three command groups are submitted to two distinct queues, and potential implementation in an OpenCL SYCL backend by a SYCL runtime. Note that in this example, each SYCL buffer (b2,b2) is implemented as separate cl_mem objects per context.

Note that the order of the definition of the accessors within the command group is irrelevant to the requirements they define. All accessors always apply to the entire command group object where they are defined.

When multiple accessors in the same command group define different requisites to the same memory object these requisites must be resolved.

Firstly, any requisites with different access modes but the same access target are resolved into a single requisite with the union of the different access modes according to Table 4. The atomic access mode acts as if it was read-write (RW) when determining the combined requirement. The rules in Table 4 are commutative and associative.

Table 4. Combined requirement from two different accessor access modes within the same command group. The rules are commutative and associative
One access mode Other access mode Combined requirement

read (RO)

write (WO)

read-write (RW)

read (RO)

read-write (RW)

read-write (RW)

write (WO)

read-write (RW)

read-write (RW)

no-init-write (NWO)

no-init-read-write (NRW)

no-init-read-write (NRW)

no-init-write (NWO)

write (WO)

write (WO)

no-init-write (NWO)

read (RO)

read-write (RW)

no-init-write (NWO)

read-write (RW)

read-write (RW)

no-init-read-write (NRW)

write (WO)

read-write (RW)

no-init-read-write (NRW)

read (RO)

read-write (RW)

no-init-read-write (NRW)

read-write (RW)

read-write (RW)

The result of this should be that there should not be any requisites with the same access target.

Secondly, the remaining requisites must adhere to the following rule. Only one of the requisites may have write access (W or RW), otherwise the SYCL runtime must throw an exception. All requisites create a requirement for the data they represent to be made available in the specified access target, however only the requisite with write access determines the side effects of the command group, i.e. only the data which that requisite represents will be updated.

For example:

  • CG(b1GRW, b1HR) is permitted.

  • CG(b1GRW, b1HRW) is not permitted.

  • CG(b1GW, b1CRW) is not permitted.

Where G and C correspond to a target::device and target::constant_buffer accessor and H corresponds to a host accessor.

A buffer created from a range of an existing buffer is called a sub-buffer. A buffer may be overlaid with any number of sub-buffers. Accessors can be created to operate on these sub-buffers. Refer to Section 4.7.2 for details on sub-buffer creation and restrictions. A requirement to access a sub-buffer is represented by specifying its range, e.g. CG(b1RW,[0,5)) represents the requirement of accessing the range [0,5) buffer b1 in read write mode.

If two accessors are constructed to access the same buffer, but both are to non-overlapping sub-buffers of the buffer, then the two accessors are said to not overlap, otherwise the accessors do overlap. Overlapping is the test that is used to determine the scheduling order of command groups. Command-groups with non-overlapping requirements may execute concurrently.

Table 5. Requirements on overlapping vs non-overlapping sub-buffer
SYCL Application Enqueue Order SYCL Kernel Execution Order
sycl::queue q1(context1);
q1.submit(CGa(b1{RW,[0,10)}));
q1.submit(CGb(b1{RW,[10,20)));
q1.submit(CGc(b1RW,[5,15)));
overlap

It is permissible for command groups that only read data to not copy that data back to the host or other devices after reading and for the runtime to maintain multiple read-only copies of the data on multiple devices.

A special case of requirement is the one defined by a host accessor. Host accessors are represented with H(MemoryObjectAccessMode), e.g, H(b1RW) represents a host accessor to b1 in read-write mode. Host accessors are a special type of accessor constructed from a memory object outside a command group, and require that the data associated with the given memory object is available on the host in the given pointer. This causes the runtime to block on construction of this object until the requirement has been satisfied. Host accessor objects are effectively barriers on all accesses to a certain memory object. Table 6 shows an example of multiple command groups enqueued to the same queue. Once the host accessor H(b1RW) is reached, the execution cannot proceed until CGa is finished. However, CGb does not have any requirements on b1, therefore, it can execute concurrently with the barrier. Finally, CGc will be enqueued after H(b1RW) is finished, but still has to wait for CGb to conclude for all its requirements to be satisfied. See Section 3.9.8 for details on synchronization rules.

Table 6. Execution of command groups when using host accessors
SYCL Application Enqueue Order SYCL Kernel Execution Order
sycl::queue q1;
q1.submit(CGa(b1RW));
q1.submit(CGb(b2RW));

H(b1RW);

q1.submit(CGc(b1RW, b2RW));
host acc

3.8.2. SYCL device memory model

The memory model for SYCL devices is based on the OpenCL 1.2 memory model. Work-items executing in a kernel have access to three distinct address spaces (memory regions) and a virtual address space overlapping some concrete address spaces:

  • Global-memory is accessible to all work-items in all work-groups. Work-items can read from or write to any element of a global memory object. Reads and writes to global memory may be cached depending on the capabilities of the device. Global memory is persistent across kernel invocations. Concurrent access to a location in an USM allocation by two or more executing kernels where at least one kernel modifies that location is a data race; there is no guarantee of correct results unless mem-fence and atomic operations are used.

  • Local-memory is accessible to all work-items in a single work-group. Attempting to access local memory in one work-group from another work-group results in undefined behavior. This memory region can be used to allocate variables that are shared by all work-items in a work-group. Work-group-level visibility allows local memory to be implemented as dedicated regions of the device memory where this is appropriate.

  • Private-memory is a region of memory private to a work-item. Attempting to access private memory in one work-item from another work-item results in undefined behavior.

  • Generic-memory is a virtual address space which overlaps the global, local and private address spaces. Therefore, an object that resides in the global, local, or private address space can also be accessed through the generic address space.

3.8.2.1. Access to memory

Accessors in the device kernels provide access to the memory objects, acting as pointers to the corresponding address space.

Pointers can be passed directly as kernel arguments if an implementation supports USM. See Section 4.8 for information on when it is legal to dereference pointers passed from the host inside kernels.

To allocate local memory within a kernel, the user can either pass a sycl::local_accessor object as a argument to an ND-range kernel (that has a user-defined work-group size), or can define a variable in work-group scope inside sycl::parallel_for_work_group.

Any variable defined inside a sycl::parallel_for scope or sycl::parallel_for_work_item scope will be allocated in private memory. Any variable defined inside a sycl::parallel_for_work_group scope will be allocated in local memory.

Users can create accessors that reference sub-buffers as well as entire buffers.

Within kernels, the underlying C++ pointer types can be obtained from an accessor. The pointer types will contain a compile-time deduced address space. So, for example, if a C++ pointer is obtained from an accessor to global memory, the C++ pointer type will have a global address space attribute attached to it. The address space attribute will be compile-time propagated to other pointer values when one pointer is initialized to another pointer value using a defined algorithm.

When developers need to explicitly state the address space of a pointer value, one of the explicit pointer classes can be used. There is a different explicit pointer class for each address space: sycl::raw_local_ptr, sycl::raw_global_ptr, sycl::raw_private_ptr, sycl::raw_generic_ptr, sycl::decorated_local_ptr, sycl::decorated_global_ptr, sycl::decorated_private_ptr, or sycl::decorated_generic_ptr.

The classes with the decorated prefix expose pointers that use an implementation-defined address space decoration, while the classes with the raw prefix do not. Buffer accessors with an access target target::device or target::constant_buffer and local accessors can be converted into explicit pointer classes (multi_ptr).

For templates that need to adapt to different address spaces, a sycl::multi_ptr class is defined which is templated via a compile-time constant enumerator value to specify the address space.

3.8.3. SYCL memory consistency model

The SYCL memory consistency model is based upon the memory consistency model of the C++ core language. Where SYCL offers extensions to classes and functions that may affect memory consistency, the default behavior when these extensions are not used always matches the behavior of standard C++.

A SYCL implementation must guarantee that the same memory consistency model is used across host and device code. Every device compiler must support the memory model defined by the minimum version of C++ described in Section 3.9.1; SYCL implementations supporting additional versions of C++ must also support the corresponding memory models.

Within a work-item, operations are ordered according to the sequenced before relation defined by the C++ core language.

Ensuring memory consistency across different work-items requires careful usage of group barrier operations, mem-fence operations and atomic operations. The ordering of operations across different work-items is determined by the happens before relation defined by the C++ core language, with a single relation governing all address spaces (memory regions).

On any SYCL device, local and global memory may be made consistent across work-items in a single group through use of a group barrier operation. On SYCL devices supporting acquire-release or sequentially consistent memory orderings, all memory visible to a set of work-items may be made consistent across the work-items in that set through the use of mem-fence and atomic operations.

Memory consistency between the host and SYCL device(s), or different SYCL devices in the same context, can be guaranteed through synchronization in the host application as defined in Section 3.9.8. On SYCL devices supporting concurrent atomic accesses to USM allocations and acquire-release or sequentially consistent memory orderings, cross-device memory consistency can be enforced through the use of mem-fence and atomic operations.

3.8.3.1. Memory ordering
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
namespace sycl {

enum class memory_order : /* unspecified */ {
  relaxed,
  acquire,
  release,
  acq_rel,
  seq_cst
};

inline constexpr auto memory_order_relaxed = memory_order::relaxed;
inline constexpr auto memory_order_acquire = memory_order::acquire;
inline constexpr auto memory_order_release = memory_order::release;
inline constexpr auto memory_order_acq_rel = memory_order::acq_rel;
inline constexpr auto memory_order_seq_cst = memory_order::seq_cst;

} // namespace sycl

The memory synchronization order of a given atomic operation is controlled by a sycl::memory_order parameter, which can take one of the following values:

  • sycl::memory_order::relaxed;

  • sycl::memory_order::acquire;

  • sycl::memory_order::release;

  • sycl::memory_order::acq_rel;

  • sycl::memory_order::seq_cst.

The meanings of these values are identical to those defined in the C++ core language.

These memory orders are listed above from weakest (memory_order::relaxed) to strongest (memory_order::seq_cst).

The complete set of memory orders is not guaranteed to be supported by every device, nor across all combinations of devices within a platform. The set of supported memory orders can be queried via the information descriptors for the sycl::device and sycl::context classes.

SYCL implementations are not required to support a memory order equivalent to std::memory_order::consume, and using this ordering within a SYCL device kernel results in undefined behavior. Developers are encouraged to use sycl::memory_order::acquire instead.

3.8.3.2. Memory scope
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
namespace sycl {

enum class memory_scope : /* unspecified */ {
  work_item,
  sub_group,
  work_group,
  device,
  system
};

inline constexpr auto memory_scope_work_item = memory_scope::work_item;
inline constexpr auto memory_scope_sub_group = memory_scope::sub_group;
inline constexpr auto memory_scope_work_group = memory_scope::work_group;
inline constexpr auto memory_scope_device = memory_scope::device;
inline constexpr auto memory_scope_system = memory_scope::system;

} // namespace sycl

The set of work-items and devices to which the memory ordering constraints of a given atomic operation apply is controlled by a sycl::memory_scope parameter, which can take one of the following values:

  • sycl::memory_scope::work_item The ordering constraint applies only to the calling work-item;

  • sycl::memory_scope::sub_group The ordering constraint applies only to work-items in the same sub-group as the calling work-item;

  • sycl::memory_scope::work_group The ordering constraint applies only to work-items in the same work-group as the calling work-item;

  • sycl::memory_scope::device The ordering constraint applies only to work-items executing on the same device as the calling work-item;

  • sycl::memory_scope::system The ordering constraint applies to any work-item or host thread in the system that is currently permitted to access the memory allocation containing the referenced object, as defined by the capabilities of buffers and USM.

The memory scopes are listed above from narrowest (memory_scope::work_item) to widest (memory_scope::system).

The complete set of memory scopes is not guaranteed to be supported by every device. The set of supported memory scopes can be queried via the information descriptors for the sycl::device and sycl::context classes.

The widest scope that can be applied to an atomic operation corresponds to the set of work-items which can access the associated memory location. For example, the widest scope that can be applied to atomic operations in work-group local memory is sycl::memory_scope::work_group. If a wider scope is supplied, the behavior is as-if the narrowest scope containing all work-items which can access the associated memory location was supplied.

The addition of memory scopes to the C++ memory model modifies the definition of some concepts from the C++ core language. For example: data races, the synchronizes-with relationship and sequential consistency must be defined in a way that accounts for atomic operations with differing (but compatible) scopes, in a manner similar to the OpenCL 2.0 specification. Efforts to formalize the memory model of SYCL are ongoing, and a formal memory model will be included in a future version of the SYCL specification.

3.8.3.3. Atomic operations

Atomic operations can be performed on memory in buffers and USM. The sycl::atomic_ref class must be used to provide safe atomic access to the buffer or USM allocation from device code.

3.8.3.4. Forward progress

This section, and any subsequent section referring to progress guarantees, uses the following terms as defined in the C++ core language: thread of execution; weakly parallel forward progress guarantees; parallel forward progress guarantees; concurrent forward progress guarantees; and block with forward progress guarantee delegation.

Each work-item in SYCL is a separate thread of execution, providing at least weakly parallel forward progress guarantees. Whether work-items provide stronger forward progress guarantees is implementation-defined.

All implementations must additionally ensure that a work-item arriving at a group barrier does not prevent other work-items in the same group from making progress. When a work-item arrives at a group barrier acting on group G, implementations must eventually select and potentially strengthen another work-item in group G that has not yet arrived at the barrier.

When a host thread blocks on the completion of a command previously submitted to a SYCL queue (for example, via the sycl::queue::wait function), it blocks with forward progress guarantee delegation.

SYCL commands submitted to a queue are not guaranteed to begin executing until a host thread blocks on their completion. In the absence of multiple host threads, there is no guarantee that host and device code will execute concurrently.

3.9. The SYCL programming model

A SYCL program is written in standard C++. Host code and device code is written in the same C++ source file, enabling instantiation of templated kernels from host code and also enabling kernel source code to be shared between host and device. The device kernels are encapsulated C++ callable types (a function object with operator() or a lambda function), which have been designated to be compiled as SYCL kernels.

SYCL programs target heterogeneous systems. The kernels may be compiled and optimized for multiple different processor architectures with very different binary representations.

3.9.1. Minimum version of C++

The C++ features used in SYCL are based on a specific version of C++. Implementations of SYCL must support this minimum C++ version, which defines the C++ constructs that can consequently be used by SYCL feature definitions (for example, lambdas).

The minimum C++ version of this SYCL specification is determined by the normative C++ core language defined in Section 3.3. All implementations of this specification must support at least this core language, and features within this specification are defined using features of the core language. Note that not all core language constructs are supported within SYCL kernel functions or code invoked by a SYCL kernel function, as detailed by Section 5.4.

Implementations may support newer C++ versions than the minimum required by SYCL. Code written using newer features than the SYCL requirement, though, may not be portable to other implementations that don’t support the same C++ version.

3.9.2. Alignment with future versions of C++

Some features of SYCL are aligned with the next C++ specification, as defined in Section 3.3.

The following features are pre-adopted by SYCL 2020 and made available in the sycl:: namespace: std::span, std::dynamic_extent, std::bit_cast. The implementations of pre-adopted features are compliant with the next C++ specification, and are expected to forward directly to standard C++ features in a future version of SYCL.

The following features of SYCL 2020 use syntax based on the next C++ specification: sycl::atomic_ref. These features behave as described in the next C++ specification, barring modifications to ensure compatibility with other SYCL 2020 features and heterogeneous programming. Any such modifications are documented in the corresponding sections of this specification.

3.9.3. Basic data parallel kernels

Data-parallel kernels that execute as multiple work-items and where no local synchronization is required are enqueued with the sycl::parallel_for function parameterized by a sycl::range parameter. These kernels will execute the kernel function body once for each work-item in the specified range.

Functionality tied to groups of work-items, including group barriers and local memory, must not be used within these kernels.

Variables with reduction semantics can be added to basic data parallel kernels using the features described in Section 4.9.2.

3.9.4. Work-group data parallel kernels

Data parallel kernels can also execute in a mode where the set of work-items is divided into work-groups of user-defined dimensions. The user specifies the global range and local work-group size as parameters to the sycl::parallel_for function with a sycl::nd_range parameter. In this mode of execution, kernels execute over the nd-range in work-groups of the specified size. It is possible to share data among work-items within the same work-group in local or global memory and to synchronize between work-items in the same work-group by calling the group_barrier function. All work-groups in a given parallel_for will be the same size, and the global size defined in the nd-range must either be a multiple of the work-group size in each dimension, or the global size must be zero. When the global size is zero, the kernel function is not executed, the local size is ignored, and any dependencies are satisfied.

Work-groups may be further subdivided into sub-groups. The work-items that compose a sub-group are selected in an implementation-defined way, and therefore the size and number of sub-groups may differ for each kernel. Moreover, different devices may make different guarantees with respect to how sub-groups within a work-group are scheduled. The maximum number of work-items in any sub-group in a kernel is based on a combination of the kernel and its dispatch dimensions. The size of any sub-group in the dispatch is between 1 and this maximum sub-group size, and the size of an individual sub-group is invariant for the duration of a kernel’s execution. Similarly to work-groups, the work-items within the same sub-group can be synchronized by calling the group_barrier function.

Portable device code must not assume that work-items within a sub-group execute in any particular order, that work-groups are subdivided into sub-groups in a specific way, nor that the work-items within a sub-group provide specific forward progress guarantees.

Variables with reduction semantics can be added to work-group data parallel kernels using the features described in Section 4.9.2.

3.9.5. Hierarchical data parallel kernels

Based on developer and implementation feedback, the hierarchical data parallel kernel feature described next is undergoing improvements to better align with the frameworks and patterns prevalent in modern programming. As this is a key part of the SYCL API and we expect to make changes to it, we temporarily recommend that new codes refrain from using this feature until the new API is finished in a near-future version of the SYCL specification, when full use of the updated feature will be recommended for use in new code. Existing codes using this feature will of course be supported by conformant implementations of this specification.

The SYCL compiler provides a way of specifying data parallel kernels that execute within work-groups via a different syntax which highlights the hierarchical nature of the parallelism. This mode is purely a compiler feature and does not change the execution model of the kernel. Instead of calling sycl::parallel_for the user calls sycl::parallel_for_work_group with a sycl::range value representing the number of work-groups to launch and optionally a second sycl::range representing the size of each work-group for performance tuning. All code within the parallel_for_work_group scope effectively executes once per work-group. Within the parallel_for_work_group scope, it is possible to call parallel_for_work_item which creates a new scope in which all work-items within the current work-group execute. This enables a programmer to write code that looks like there is an inner work-item loop inside an outer work-group loop, which closely matches the effect of the execution model. All variables declared inside the parallel_for_work_group scope are allocated in work-group local memory, whereas all variables declared inside the parallel_for_work_item scope are declared in private memory. All parallel_for_work_item calls within a given parallel_for_work_group execution must have the same dimensions.

3.9.6. Kernels that are not launched over parallel instances

Simple kernels for which only a single instance of the kernel function will be executed are enqueued with the sycl::single_task function. The kernel enqueued takes no “work-item id” parameter and will only execute once. The behavior is logically equivalent to executing a kernel on a single compute unit with a single work-group comprising only one work-item. Such kernels may be enqueued on multiple queues and devices and as a result may be executed in task-parallel fashion.

3.9.7. Pre-defined kernels

Some SYCL backends may expose pre-defined functionality to users as kernels. These kernels are not programmable, hence they are not bound by the SYCL C++ programming model restrictions, and how they are written is implementation-defined.

3.9.8. Synchronization

Synchronization of processing elements executing inside a device is handled by the SYCL device kernel following the SYCL kernel execution model. The synchronization of the different SYCL device kernels executing with the host memory is handled by the SYCL application via the SYCL runtime.

3.9.8.1. Synchronization in the SYCL application

Synchronization points between host and device(s) are exposed through the following operations:

  • Buffer destruction: The destructors for sycl::buffer, sycl::unsampled_image and sycl::sampled_image objects wait for all submitted work on those objects to complete and to copy the data back to host memory before returning. These destructors only wait if the object was constructed with attached host memory and if data needs to be copied back to the host.

    More complex forms of synchronization on buffer destruction can be specified by the user by constructing buffers with other kinds of references to memory, such as shared_ptr and unique_ptr.

  • Host Accessors: The constructor for a host accessor waits for all kernels that modify the same buffer (or image) in any queues to complete and then copies data back to host memory before the constructor returns. Any command groups with requirements to the same memory object cannot execute until the host accessor is destroyed as shown on Table 6.

  • Command group enqueue: The SYCL runtime internally ensures that any command groups added to queues have the correct event dependencies added to those queues to ensure correct operation. Adding command groups to queues never blocks. Instead any required synchronization is added to the queue and events of type sycl::event are returned by the queue’s submit function that contain event information related to the specific command group.

  • Queue operations: The user can manually use queue operations, such as sycl::queue::wait() to block execution of the calling thread until all the command groups submitted to the queue have finished execution. Note that this will also affect the dependencies of those command groups in other queues.

  • SYCL event objects: SYCL provides sycl::event objects which can be used for synchronization. If synchronization is required across SYCL contexts from different SYCL backends, then the SYCL runtime ensures that extra host-based synchronization is added to enable the SYCL event objects to operate between contexts correctly.

Note that the destructors of other SYCL objects (sycl::queue, sycl::context,…) do not block. Only a sycl::buffer, sycl::sampled_image or sycl::unsampled_image destructor might block. The rationale is that an object without any side effect on the host does not need to block on destruction as it would impact the performance. So it is up to the programmer to use a member function to wait for completion in some cases if this does not fit the goal. See Section 3.9.12 for more information on object life time.

3.9.8.2. Synchronization in SYCL kernels

In SYCL, synchronization can be either global or local within a group of work-items. Synchronization between work-items in a single group is achieved using a group barrier.

All the work-items of a group must execute the barrier before any are allowed to continue execution beyond the barrier. Note that the group barrier must be encountered by all work-items of a group executing the kernel or by none at all. In SYCL, work-group barrier and sub-group barrier functionality is exposed via the group_barrier function.

Synchronization between work-items in different work-groups via atomic operations is possible only on SYCL devices with certain capabilities, as described in Section 3.8.3.

3.9.9. Error handling

In SYCL, there are two types of errors: synchronous errors that can be detected immediately when an API call is made, and asynchronous errors that can only be detected later after an API call has returned. Synchronous errors, such as failure to construct an object, are reported immediately by the runtime throwing an exception. Asynchronous errors, such as an error occurring during execution of a kernel on a device, are reported via an asynchronous error-handler mechanism.

Asynchronous errors are not reported immediately as they occur. The asynchronous error handler for a context or queue is called with a sycl::exception_list object, which contains a list of asynchronously-generated exception objects, on the conditions described by Section 4.13.1.1 and Section 4.13.1.2.

Asynchronous errors may be generated regardless of whether the user has specified any asynchronous error handler(s), as described in Section 4.13.1.2.

Some SYCL backends can report errors that are specific to the platform they are targeting, or that are more concrete than the errors provided by the SYCL API. Any error reported by a SYCL backend must derive from the base sycl::exception. When a user wishes to capture specifically an error thrown by a SYCL backend, she must include the SYCL backend-specific headers for said SYCL backend.

3.9.10. Fallback mechanism

A command group function object can be submitted either to a single queue to be executed on, or to a secondary queue. If a command group function object fails to be enqueued to the primary queue, then the system will attempt to enqueue it to the secondary queue, if given as a parameter to the submit function. If the command group function object fails to be queued to both of these queues, then a synchronous SYCL exception will be thrown.

It is possible that a command group may be successfully enqueued, but then asynchronously fail to run, for some reason. In this case, it may be possible for the runtime system to execute the command group function object on the secondary queue, instead of the primary queue. The situations where a SYCL runtime may be able to achieve this asynchronous fall-back is implementation-defined.

3.9.11. Scheduling of kernels and data movement

A command group function object takes a reference to a command group handler as a parameter and anything within that scope is immediately executed and takes the handler object as a parameter. The intention is that a user will perform calls to SYCL functions, member functions, destructors and constructors inside that scope. These calls will be non-blocking on the host, but enqueue operations to the queue that the command group is submitted to. All user functions within the command group scope will be called on the host as the command group function object is executed, but any commands it invokes will be added to the SYCL queue. All commands added to the queue will be executed out-of-order from each other, according to their data dependencies.

3.9.12. Managing object lifetimes

A SYCL application does not initialize any SYCL backend features until a sycl::context object is created. A user does not need to explicitly create a sycl::context object, but they do need to explicitly create a sycl::queue object, for which a sycl::context object will be implicitly created if not provided by the user.

All SYCL backend objects encapsulated in SYCL objects are reference-counted and will be destroyed once all references have been released. This means that a user needs only create a SYCL queue (which will automatically create an SYCL context) for the lifetime of their application to initialize and release any SYCL backend objects safely.

There is no global state specified to be required in SYCL implementations. This means, for example, that if the user creates two queues without explicitly constructing a common context, then a SYCL implementation does not have to create a shared context for the two queues. Implementations are free to share or cache state globally for performance, but it is not required.

Memory objects can be constructed with or without attached host memory. If no host memory is attached at the point of construction, then destruction of that memory object is non-blocking. The user may use C++ standard pointer classes for sharing the host data with the user application and for defining blocking, or non-blocking behavior of the buffers and images. If host memory is attached by using a raw pointer, then the default behavior is followed, which is that the destructor will block until any command groups operating on the memory object have completed, then, if the contents of the memory object is modified on a device those contents are copied back to host and only then does the destructor return.

In the case where host memory is shared between the user application and the SYCL runtime with a std::shared_ptr, then the reference counter of the std::shared_ptr determines whether the buffer needs to copy data back on destruction, and in that case the blocking or non-blocking behavior depends on the user application.

Instead of a std::shared_ptr, a std::unique_ptr may be provided, which uses move semantics for initializing and using the associated host memory. In this case, the behavior of the buffer in relation to the user application will be non-blocking on destruction.

As said in Section 3.9.8, the only blocking operations in SYCL (apart from explicit wait operations) are:

  • host accessor constructor, which waits for any kernels enqueued before its creation that write to the corresponding object to finish and be copied back to host memory before it starts processing. The host accessor does not necessarily copy back to the same host memory as initially given by the user;

  • memory object destruction, in the case where copies back to host memory have to be done or when the host memory is used as a backing-store.

3.9.13. Device discovery and selection

A user specifies which queue to submit a command group function object and each queue is targeted to run on a specific device (and context). A user can specify the actual device on queue creation, or they can specify a device selector which causes the SYCL runtime to choose a device based on the user’s provided preferences. Specifying a device selector causes the SYCL runtime to perform device discovery. No device discovery is performed until a SYCL device selector is passed to a queue constructor. Device topology may be cached by the SYCL runtime, but this is not required.

Device discovery will return all devices from all platforms exposed by all the supported SYCL backends.

3.9.14. Interfacing with the SYCL backend API

There are two styles of developing a SYCL application:

  1. writing a pure SYCL generic application;

  2. writing a SYCL application that relies on some SYCL backend specific behavior.

When users follow 1., there is no assumption about what SYCL backend will be used during compilation or execution of the SYCL application. Therefore, the SYCL backend API is not assumed to be available to the developer. Only standard C++ types and interfaces are assumed to be available, as described in Section 3.9. Users only need to include the <sycl/sycl.hpp> header to write a SYCL generic application.

On the other hand, when users follow 2., they must know what SYCL backend APIs they are using. In this case, any header required for the normal programmability of the SYCL backend API is assumed to be available to the user. In addition to the <sycl/sycl.hpp> header, users must also include the SYCL backend-specific header as defined in Section 4.3. The SYCL backend-specific header provides the interoperability interface for the SYCL API to interact with native backend objects.

The interoperability API is defined in Section 4.5.1.

3.10. Memory objects

SYCL memory objects represent data that is handled by the SYCL runtime and can represent allocations in one or multiple devices at any time. Memory objects, both buffers and images, may have one or more underlying native backend objects to ensure that queues objects can use data in any device. A SYCL implementation may have multiple native backend objects for the same device. The SYCL runtime is responsible for ensuring the different copies are up-to-date whenever necessary, using whatever mechanism is available in the system to update the copies of the underlying native backend objects.

Implementation note

A valid mechanism for this update is to transfer the data from one SYCL backend into the system memory using the SYCL backend-specific mechanism available, and then transfer it to a different device using the mechanism exposed by the new SYCL backend.

Memory objects in SYCL fall into one of two categories: buffer objects and image objects. A buffer object stores a one-, two- or three-dimensional collection of elements that are stored linearly directly back to back in the same way C or C++ stores arrays. An image object is used to store a one-, two- or three-dimensional texture, frame-buffer or image data that may be stored in an optimized and device-specific format in memory and must be accessed through specialized operations.

Elements of a buffer object can be a scalar data type (such as an int or float), vector data type, or a user-defined structure. In SYCL, a buffer object is a templated type (sycl::buffer), parameterized by the element type and number of dimensions. An image object is stored in one of a limited number of formats. The elements of an image object are selected from a list of predefined image formats which are provided by an underlying SYCL backend implementation. Images are encapsulated in the sycl::unsampled_image or sycl::sampled_image types, which are templated by the number of dimensions in the image. The minimum number of elements in an image object is one. The minimum number of elements in a buffer object is zero.

The fundamental differences between a buffer and an image object are:

  • elements in a buffer are stored in an array of 1, 2 or 3 dimensions and can be accessed using an accessor by a kernel executing on a device. The accessors for kernels provide a member function to get C++ pointer types, or the sycl::global_ptr class;

  • elements of an image are stored in a format that is opaque to the user and cannot be directly accessed using a pointer. SYCL provides image accessors and samplers to allow a kernel to read from or write to an image;

  • for a buffer object the data is accessed within a kernel in the same format as it is stored in memory, but in the case of an image object the data is not necessarily accessed within a kernel in the same format as it is stored in memory;

  • image elements are always a 4-component vector (each component can be a float or signed/unsigned integer) in a kernel. Accessors that read an image convert image elements from their storage format into a 4-component vector.

    Similarly, the SYCL accessor member functions provided to write to an image convert the image element from a 4-component vector to the appropriate image format specified such as four 8-bit elements, for example.

Users may want fine-grained control of the synchronization, memory management and storage semantics of SYCL image or buffer objects. For example, a user may wish to specify the host memory for a memory object to use, but may not want the memory object to block on destruction.

Depending on the control and the use cases of the SYCL applications, well established C++ classes and patterns can be used for reference counting and sharing data between user applications and the SYCL runtime. For control over memory allocation on the host and mapping between host and device memory, pre-defined or user-defined C++ std::allocator classes are used. For better control of synchronization between a SYCL and a non SYCL application that share data, std::shared_ptr and std::mutex classes are used.

3.11. Multi-dimensional objects and linearization

SYCL defines a number of multi-dimensional objects such as buffers and accessors. The iteration space of work-items in a kernel may also be multi-dimensional. The size of each dimension is defined by a range object of one, two or three dimensions, and an element in the multi-dimensional space can be identified using an id object with the same number of dimensions as the corresponding range.

If the size of any dimension is zero, there are zero elements in the multi-dimensional range.

3.11.1. Linearization

Some multi-dimensional objects can be viewed in a linear form. When this happens, the right-most term in the object’s range varies fastest in the linearization.

A three-dimensional element id{id0, id1, id2} within a three-dimensional object of range range{r0, r1, r2} has a linear position defined by:

A two-dimensional element id{id0, id1} within a two-dimensional range{r0, r1} follows a similar equation:

A one-dimensional element id{id0} within a one-dimensional range range{r0} is equivalent to its linear form.

3.11.2. Multi-dimensional subscript operators

Some multi-dimensional objects can be indexed using the subscript operator where consecutive subscript operators correspond to each dimension. The right-most operator varies fastest, as with standard C++ arrays. Formally, a three-dimensional subscript access a[id0][id1][id2] references the element at id{id0, id1, id2}. A two-dimensional subscript access a[id0][id1] references the element at id{id0, id1}. A one-dimensional subscript access a[id0] references the element at id{id0}.

3.12. Implementation options

The SYCL language is designed to allow several different possible implementations. The contents of this section are non-normative, so implementations need not follow the guidelines listed here. However, this section is intended to help readers understand the possible strategies that can be used to implement SYCL.

3.12.1. Single source multiple compiler passes

With this technique, known as SMCP, there are separate host and device compilers. Each SYCL source file is compiled two times: once by the host compiler and once by the device compiler. An implementation could support more than one device compiler, in which case each SYCL source file is compiled more than two times. The host compiler in this technique could be an off-the-shelf compiler with no special knowledge of SYCL, but the device compiler must be SYCL aware. The device compiler parses the source file to identify each SYCL kernel function and any device functions it calls. SYCL is designed so that this analysis can be done statically. The device compiler then generates code only for the SYCL kernel functions and the device functions.

Typically, the device compilers generate header files which interface between the host compiler and the SYCL runtime. Therefore, the device compiler runs first, and then the host compiler consumes these header files when generating the host code.

The device compilers in this technique generate one or more device images for the SYCL kernel functions, which can be read by the SYCL runtime. Each device image could either contain native ISA for a device or it could contain an intermediate language such as SPIR-V. In the later case, the SYCL runtime must translate the intermediate language into native device ISA when the SYCL kernel function is submitted to a device.

Since this technique has separate host and device compilers, there needs to be some way to associate a SYCL kernel function (which is compiled by the device compiler) with the code that invokes it (which is compiled by the host compiler). Implementations conformant to the reduced feature set (Section B.2) can do this by using the C++ type of the SYCL kernel function. This type is specified via the kernel name template parameter if the SYCL kernel function is a lambda function, or it is obtained from the class type if the SYCL kernel function is an object. Implementations conformant to the full feature set (Section B.1) do not require a kernel name at the invocation site, so they must implement some other way to make the association.

3.12.2. Single source single compiler pass

With this technique, known as SSCP, the vendor implements a custom compiler that reads each SYCL source file only once, and that compiler generates the host code as well as the device images for the SYCL kernel functions. As in the SMCP case, each device image could either contain native device ISA or an intermediate language.

3.12.3. Library-only implementation

It is also possible to implement SYCL purely as a library, using an off-the-shelf host compiler with no special support for SYCL. In such an implementation, each kernel may run on the host system.

3.13. Language restrictions in kernels

The SYCL kernels are executed on SYCL devices and all of the functions called from a SYCL kernel are going to be compiled for the device by a SYCL device compiler. Due to restrictions of the heterogeneous devices where the SYCL kernel will execute, there are certain restrictions on the base C++ language features that can be used inside kernel code. For details on language restrictions please refer to Section 5.4.

SYCL kernels use arguments that are captured by value in the command group scope or are passed from the host to the device using accessors. Sharing data structures between host and device code imposes certain restrictions, such as using only objects that are device copyable, and in general, no pointers initialized for the host can be used on the device. SYCL memory objects, such as sycl::buffer, sycl::unsampled_image, and sycl::sampled_image, cannot be passed to a kernel. Instead, a kernel must interact with these objects through accessors. No hierarchical structures of these memory object classes are supported and any other data containers need to be converted to the SYCL data management classes using the SYCL interface. For more details on the rules for kernel parameter passing, please refer to Section 4.12.4.

Pointers to USM allocations may be passed to a kernel either directly as arguments or indirectly inside of other objects. Pointers to USM allocations that are passed as kernel arguments are treated as being in the global address space.

3.13.1. Device copyable

The SYCL implementation may need to copy data between the host and a device or between two devices. For example, this may occur when a command group has a requirement for the contents of a buffer or when the application passes certain arguments to a SYCL kernel function (as described in Section 4.12.4). Such data must have a type that is device copyable as defined below.

Any type that is trivially copyable (as defined by the C++ core language) is implicitly device copyable.

Although implementations are not required to support device code that calls library functions from the C++ core language, some implementations may provide device support for some of these functions. If the implementation provides device support for one of the following classes, that type is also implicitly device copyable:

  • std::array<T, 0>;

  • std::array<T, N> if T is device copyable;

  • std::optional<T> if T is device copyable;

  • std::pair<T1, T2> if T1 and T2 are device copyable;

  • std::tuple<>;

  • std::tuple<Types...> if all the types in the parameter pack Types are device copyable;

  • std::variant<>;

  • std::variant<Types...> if all the types in the parameter pack Types are device copyable;

  • std::basic_string_view<CharT, Traits>;

  • std::span<ElementType, Extent> (the std::span type has been introduced in C++20);

  • sycl::span<ElementType, Extent>.

If the implementation provides device support for one of the classes listed above, arrays of that class and cv-qualified versions of that class are also device copyable.

The types std::basic_string_view<CharT, Traits> and std::span<ElementType, Extent> are both view types, which reference underlying data that is not contained within their type. Although these view types are device copyable, the implementation copies just the view and not the contained data when doing an inter-device copy. In order to reference the contained data after such a copy, the application must allocate the contained data in unified shared memory (USM) that is accessible on both the host and device (or on both devices in the case of a device-to-device copy).

In addition, the implementation may allow the application to explicitly declare certain class types as device copyable. If the implementation has this support, it must predefine the preprocessor macro SYCL_DEVICE_COPYABLE to 1, and it must not predefine this preprocessor macro if it does not have this support. When the implementation has this support, a class type T is device copyable if all of the following statements are true:

  • The application defines the trait is_device_copyable_v<T> to true;

  • Type T has at least one eligible copy constructor, move constructor, copy assignment operator, or move assignment operator;

  • Each eligible copy constructor, move constructor, copy assignment operator, and move assignment operator is public;

  • When doing an inter-device transfer of an object of type T, the effect of each eligible copy constructor, move constructor, copy assignment operator, and move assignment operator is the same as a bitwise copy of the object;

  • Type T has a public non-deleted destructor;

  • The destructor has no effect when executed on the device.

When the application explicitly declares a class type to be device copyable, arrays of that type and cv-qualified versions of that type are also device copyable, and the implementation sets the is_device_copyable_v trait to true for these array and cv-qualified types.

It is unspecified whether the implementation actually calls the copy constructor, move constructor, copy assignment operator, or move assignment operator of a class declared as is_device_copyable_v when doing an inter-device copy. Since these operations must all be the same as a bitwise copy, the implementation may simply copy the memory where the object resides. Likewise, it is unspecified whether the implementation actually calls the destructor for such a class on the device since the destructor must have no effect on the device.

3.14. Endianness support

SYCL does not mandate any particular byte order, but the byte order of the host always matches the byte order of the devices. This allows data to be copied between the host and the devices without any byte swapping.

3.15. Example SYCL application

Below is a more complex example application, combining some of the features described above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
#include <iostream>
#include <sycl/sycl.hpp>
using namespace sycl; // (optional) avoids need for "sycl::" before SYCL names

// Size of the matrices
constexpr size_t N = 2000;
constexpr size_t M = 3000;

int main() {
  // Create a queue to work on
  queue myQueue;

  // Create some 2D buffers of float for our matrices
  buffer<float, 2> a { range<2> { N, M } };
  buffer<float, 2> b { range<2> { N, M } };
  buffer<float, 2> c { range<2> { N, M } };

  // Launch an asynchronous kernel to initialize a
  myQueue.submit([&](handler& cgh) {
    // The kernel writes a, so get a write accessor on it
    accessor A { a, cgh, write_only };

    // Enqueue a parallel kernel iterating on a N*M 2D iteration space
    cgh.parallel_for(range<2> { N, M },
                     [=](id<2> index) { A[index] = index[0] * 2 + index[1]; });
  });

  // Launch an asynchronous kernel to initialize b
  myQueue.submit([&](handler& cgh) {
    // The kernel writes b, so get a write accessor on it
    accessor B { b, cgh, write_only };

    // From the access pattern above, the SYCL runtime detects that this
    // command_group is independent from the first one and can be
    // scheduled independently

    // Enqueue a parallel kernel iterating on a N*M 2D iteration space
    cgh.parallel_for(range<2> { N, M }, [=](id<2> index) {
      B[index] = index[0] * 2014 + index[1] * 42;
    });
  });

  // Launch an asynchronous kernel to compute matrix addition c = a + b
  myQueue.submit([&](handler& cgh) {
    // In the kernel a and b are read, but c is written
    accessor A { a, cgh, read_only };
    accessor B { b, cgh, read_only };
    accessor C { c, cgh, write_only };

    // From these accessors, the SYCL runtime will ensure that when
    // this kernel is run, the kernels computing a and b have completed

    // Enqueue a parallel kernel iterating on a N*M 2D iteration space
    cgh.parallel_for(range<2> { N, M },
                     [=](id<2> index) { C[index] = A[index] + B[index]; });
  });

  // Ask for an accessor to read c from application scope.  The SYCL runtime
  // waits for c to be ready before returning from the constructor
  host_accessor C { c, read_only };
  std::cout << std::endl << "Result:" << std::endl;
  for (size_t i = 0; i < N; i++) {
    for (size_t j = 0; j < M; j++) {
      // Compare the result to the analytic value
      if (C[i][j] != i * (2 + 2014) + j * (1 + 42)) {
        std::cout << "Wrong value " << C[i][j] << " on element " << i << " "
                  << j << std::endl;
        exit(-1);
      }
    }
  }

  std::cout << "Good computation!" << std::endl;
  return 0;
}

4. SYCL programming interface

The SYCL programming interface provides a common abstracted feature set to one or more SYCL backend APIs. This section describes the C++ library interface to the SYCL runtime which executes across those SYCL backends.

The entirety of the SYCL interface defined in this section is required to be available for any SYCL backends, with the exception of the interoperability interface, which is described in general terms in this document, not pertaining to any particular SYCL backend.

SYCL guarantees that all the member functions and special member functions of the SYCL classes described are thread safe.

The underlying types for all enumerations defined in this specification are implementation-defined. In addition, all enumerators within an enumeration have some implementation-defined unique value unless the specification specifically indicates a values for the enumerator.

4.1. Backends

The SYCL backends that can be supported by a SYCL implementation are identified using the enum class backend.

1
2
3
4
5
namespace sycl {
enum class backend : /* unspecified */ {
  /* see below */
};
} // namespace sycl

The enum class backend is implementation-defined and must be populated with a unique identifier for each SYCL backend that the SYCL implementation can support. Note that the SYCL backends listed in the enum class backend are not guaranteed to be available in a given installation.

Each named SYCL backend enumerated in the enum class backend must be associated with a SYCL backend specification. Many sections of this specification will refer to the associated SYCL backend specification.

4.1.1. Backend macros

As the identifiers defined in enum class backend are implementation-defined, and the associated backends not guaranteed to be available, a SYCL implementation must also define a preprocessor macro for each of these identifiers. If the SYCL backend is defined by the Khronos SYCL group, the name of the macro has the form SYCL_BACKEND_<backend_name>, where backend_name is the associated identifier from backend in all upper-case. See Chapter 6 for the name of the macro if the vendor defines the SYCL backend outside of the Khronos SYCL group.

If a backend listed in the enum class backend is not available, the associated macro must be left undefined.

4.2. Generic vs non-generic SYCL

The SYCL programming API is split into two categories; generic SYCL and non-generic SYCL. Almost everything in the SYCL programming API is considered generic SYCL. However any usage of the enum class backend is considered non-generic SYCL and should only be used for SYCL backend specialized code paths, as the identifiers defined in backend are implementation-defined.

In any non-generic SYCL application code where the backend enum class is used, the expression must be guarded with a preprocessor #ifdef guard using the associated preprocessor macro to ensure that the SYCL application will compile even if the SYCL implementation does not support that SYCL backend being specialized for.

4.3. Header files and namespaces

SYCL provides one standard header file: <sycl/sycl.hpp>, which needs to be included in every translation unit that uses the SYCL programming API.

All SYCL classes, constants, types and functions defined by this specification should exist within the ::sycl namespace.

For compatibility with SYCL 1.2.1, SYCL provides another standard header file: <CL/sycl.hpp>, which can be included in place of <sycl/sycl.hpp>. In that case, all SYCL classes, constants, types and functions defined by this specification should exist within the ::cl::sycl C++ namespace.

For consistency, the programming API will only refer to the <sycl/sycl.hpp> header and the ::sycl namespace, but this should be considered synonymous with the SYCL 1.2.1 header and namespace.

Include paths starting with "sycl/ext/" and "sycl/backend/" are reserved for extensions to SYCL and for backend interop headers respectively. Other include paths starting with "sycl/" and the sycl::detail namespace are reserved for implementation details.

When a SYCL backend is defined by the Khronos SYCL group, functionality for that SYCL backend is available via the header "sycl/backend/<backend_name>.hpp", and all SYCL backend-specific functionality is made available in the namespace sycl::<backend_name> where <backend_name> is the name of the SYCL backend as defined in the SYCL backend specification.

Chapter 6 defines the allowable header files and namespaces for any extensions that a vendor may provide, including any SYCL backend that the vendor may define outside of the Khronos SYCL group.

Unless otherwise specified, the behavior of a SYCL program is undefined if it adds any entity to namespace sycl or to a namespace within namespace sycl.

4.4. Class availability

In SYCL some SYCL runtime classes are available to the SYCL application, some are available within a SYCL kernel function and some are available on both and can be passed as arguments to a SYCL kernel function.

Each of the following SYCL runtime classes: buffer, buffer_allocator, context, device, device_image, event, exception, handler, host_accessor, host_sampled_image_accessor, host_unsampled_image_accessor, id, image_allocator, kernel, kernel_id, marray, kernel_bundle, nd_range, platform, queue, range, sampled_image, image_sampler, stream, unsampled_image and vec must be available to the host application.

Each of the following SYCL runtime classes: accessor, atomic_ref, device_event, group, h_item, id, item, local_accessor, marray, multi_ptr, nd_item, range, reducer, sampled_image_accessor, stream, sub_group, unsampled_image_accessor and vec must be available within a SYCL kernel function.

4.5. Common interface

When a dimension template parameter is used in SYCL classes, it is defaulted as 1 in most cases.

4.5.1. Backend interoperability

Many of the SYCL runtime classes may be implemented such that they encapsulate an object unique to the SYCL backend that underpins the functionality of that class. Where appropriate, these classes may provide an interface for interoperating between the SYCL runtime object and the native backend object in order to support interoperability within an application between SYCL and the associated SYCL backend API.

There are three forms of interoperability with SYCL runtime classes: interoperability on the SYCL application with the SYCL backend API, interoperability within a SYCL kernel function with the equivalent kernel language types of the SYCL backend, and interoperability within a host task with the interop_handle.

SYCL application interoperability, SYCL kernel function interoperability and host task interoperability are provided via different interfaces and may have different behavior for the same SYCL object.

SYCL application interoperability may be provided for buffer, context, device, device_image, event, kernel, kernel_bundle, platform, queue, sampled_image, and unsampled_image.

SYCL kernel function interoperability may be provided for accessor, device_event, local_accessor, sampled_image_accessor, stream and unsampled_image_accessor inside kernel scope only and is not available outside of that scope.

host task interoperability may be provided for accessor, sampled_image_accessor, unsampled_image_accessor, queue, device, context inside the scope of a host task only, see Section 4.10.

Support for SYCL backend interoperability is optional and therefore not required to be provided by a SYCL implementation. A SYCL application using SYCL backend interoperability is considered to be non-generic SYCL.

Details on the interoperability for a given SYCL backend are available on the SYCL backend specification document for that SYCL backend.

4.5.1.1. Type traits backend_traits
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
namespace sycl {

template <backend Backend> class backend_traits {
 public:
  template <class T> using input_type = /* see below */;

  template <class T> using return_type = /* see below */;
};

template <backend Backend, typename SyclType>
using backend_input_t =
    typename backend_traits<Backend>::template input_type<SyclType>;

template <backend Backend, typename SyclType>
using backend_return_t =
    typename backend_traits<Backend>::template return_type<SyclType>;

} // namespace sycl

A series of type traits are provided for SYCL backend interoperability, defined in the backend_traits class.

A specialization of backend_traits must be provided for each named SYCL backend enumerated in the enum class backend that is available at compile time.

The type alias backend_input_t is provided to enable less verbose access to the input_type type within backend_traits for a specific SYCL object of type T. The type alias backend_return_t is provided to enable less verbose access to the return_type type within backend_traits for a specific SYCL object of type T.

4.5.1.2. Template function get_native
1
2
3
4
5
6
namespace sycl {

template <backend Backend, class T>
backend_return_t<Backend, T> get_native(const T& syclObject);

} // namespace sycl

For each SYCL runtime class T which supports SYCL application interoperability, a specialization of get_native must be defined, which takes an instance of T and returns a SYCL application interoperability native backend object associated with syclObject which can be used for SYCL application interoperability. The lifetime of the object returned are backend-defined and specified in the backend specification.

For each SYCL runtime class T which supports kernel function interoperability, a specialization of get_native must be defined, which takes an instance of T and returns the kernel function interoperability native backend object associated with syclObject which can be used for kernel function interoperability. The availability and behavior of these template functions is defined by the SYCL backend specification document.

The get_native function must throw an exception with the errc::backend_mismatch error code if the backend of the SYCL object doesn’t match the target backend.

4.5.1.3. Template functions make_*
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
namespace sycl {

template <backend Backend>
platform make_platform(const backend_input_t<Backend, platform>& backendObject);

template <backend Backend>
device make_device(const backend_input_t<Backend, device>& backendObject);

template <backend Backend>
context make_context(const backend_input_t<Backend, context>& backendObject,
                     const async_handler asyncHandler = {});

template <backend Backend>
queue make_queue(const backend_input_t<Backend, queue>& backendObject,
                 const context& targetContext,
                 const async_handler asyncHandler = {});

template <backend Backend>
event make_event(const backend_input_t<Backend, event>& backendObject,
                 const context& targetContext);

template <backend Backend, typename T, int Dimensions = 1,
          typename AllocatorT = buffer_allocator<std::remove_const_t<T>>>
buffer<T, Dimensions, AllocatorT>
make_buffer(const backend_input_t<Backend, buffer<T, Dimensions, AllocatorT>>&
                backendObject,
            const context& targetContext, event availableEvent);

template <backend Backend, typename T, int Dimensions = 1,
          typename AllocatorT = buffer_allocator<std::remove_const_t<T>>>
buffer<T, Dimensions, AllocatorT>
make_buffer(const backend_input_t<Backend, buffer<T, Dimensions, AllocatorT>>&
                backendObject,
            const context& targetContext);

template <backend Backend, int Dimensions = 1,
          typename AllocatorT = sycl::image_allocator>
sampled_image<Dimensions, AllocatorT> make_sampled_image(
    const backend_input_t<Backend, sampled_image<Dimensions, AllocatorT>>&
        backendObject,
    const context& targetContext, image_sampler imageSampler,
    event availableEvent);

template <backend Backend, int Dimensions = 1,
          typename AllocatorT = sycl::image_allocator>
sampled_image<Dimensions, AllocatorT> make_sampled_image(
    const backend_input_t<Backend, sampled_image<Dimensions, AllocatorT>>&
        backendObject,
    const context& targetContext, image_sampler imageSampler);

template <backend Backend, int Dimensions = 1,
          typename AllocatorT = sycl::image_allocator>
unsampled_image<Dimensions, AllocatorT> make_unsampled_image(
    const backend_input_t<Backend, unsampled_image<Dimensions, AllocatorT>>&
        backendObject,
    const context& targetContext, event availableEvent);

template <backend Backend, int Dimensions = 1,
          typename AllocatorT = sycl::image_allocator>
unsampled_image<Dimensions, AllocatorT> make_unsampled_image(
    const backend_input_t<Backend, unsampled_image<Dimensions, AllocatorT>>&
        backendObject,
    const context& targetContext);

template <backend Backend, bundle_state State>
kernel_bundle<State> make_kernel_bundle(
    const backend_input_t<Backend, kernel_bundle<State>>& backendObject,
    const context& targetContext);

template <backend Backend>
kernel make_kernel(const backend_input_t<Backend, kernel>& backendObject,
                   const context& targetContext);

} // namespace sycl

For each SYCL runtime class T which supports SYCL application interoperability, a specialization of the appropriate template function make_{sycl_class} where {sycl_class} is the class name of T, must be defined, which takes a SYCL application interoperability native backend object and constructs and returns an instance of T. The availability and behavior of these template functions is defined by the SYCL backend specification document.

Overloads of the make_{sycl_class} function which take a SYCL context object as an argument must throw an exception with the errc::backend_mismatch error code if the backend of the provided SYCL context doesn’t match the target backend.

4.5.2. Common reference semantics

Each of the following SYCL runtime classes: accessor, buffer, context, device, device_image, event, host_accessor, host_sampled_image_accessor, host_unsampled_image_accessor, kernel, kernel_id, kernel_bundle, local_accessor, platform, queue, sampled_image, sampled_image_accessor, stream, unsampled_image and unsampled_image_accessor must obey the following statements, where T is the runtime class type:

  • T must be copy constructible and copy assignable on the host application and within SYCL kernel functions in the case that T is a valid kernel argument. Any instance of T that is constructed as a copy of another instance, via either the copy constructor or copy assignment operator, must behave as-if it were the original instance and as-if any action performed on it were also performed on the original instance and must represent the same underlying native backend object as the original instance where applicable.

  • T must be destructible on the host application and within SYCL kernel functions in the case that T is a valid kernel argument. When any instance of T is destroyed, including as a result of the copy assignment operator, any behavior specific to T that is specified as performed on destruction is only performed if this instance is the last remaining host copy, in accordance with the above definition of a copy.

  • T must be move constructible and move assignable on the host application and within SYCL kernel functions in the case that T is a valid kernel argument. Any instance of T that is constructed as a move of another instance, via either the move constructor or move assignment operator, must replace the original instance rendering said instance invalid and must represent the same underlying native backend object as the original instance where applicable.

  • T must be equality comparable on the host application. Equality between two instances of T (i.e. a == b) must be true if one instance is a copy of the other and non-equality between two instances of T (i.e. a != b) must be true if neither instance is a copy of the other, in accordance with the above definition of a copy, unless either instance has become invalidated by a move operation. By extension of the requirements above, equality on T must guarantee to be reflexive (i.e. a == a), symmetric (i.e. a == b implies b == a and a != b implies b != a) and transitive (i.e. a == b && b == c implies c == a).

  • A specialization of std::hash for T must exist on the host application that returns a unique value such that if two instances of T are equal, in accordance with the above definition, then their resulting hash values are also equal and subsequently if two hash values are not equal, then their corresponding instances are also not equal, in accordance with the above definition.

Some SYCL runtime classes will have additional behavior associated with copy, movement, assignment or destruction semantics. If these are specified they are in addition to those specified above unless stated otherwise.

Each of the runtime classes mentioned above must provide a common interface of special member functions in order to fulfill the copy, move, destruction requirements and hidden friend functions in order to fulfill the equality requirements.

A hidden friend function is a function first declared via a friend declaration with no additional out of class or namespace scope declarations. Hidden friend functions are only visible to ADL (Argument Dependent Lookup) and are hidden from qualified and unqualified lookup. Hidden friend functions have the benefits of avoiding accidental implicit conversions and faster compilation.

These common special member functions and hidden friend functions are described in Table 7 and Table 8 respectively.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
namespace sycl {

class T {
  ...

      public : T(const T& rhs);

  T(T&& rhs);

  T& operator=(const T& rhs);

  T& operator=(T&& rhs);

  ~T();

  ...

      friend bool
      operator==(const T& lhs, const T& rhs) { /* ... */
  }

  friend bool operator!=(const T& lhs, const T& rhs) { /* ... */ }

  ...
};
} // namespace sycl
Table 7. Common special member functions for reference semantics
Special member function Description
T(const T& rhs)

Constructs a T instance as a copy of the RHS SYCL T in accordance with the requirements set out above.

T(T&& rhs)

Constructs a SYCL T instance as a move of the RHS SYCL T in accordance with the requirements set out above.

T& operator=(const T& rhs)

Assigns this SYCL T instance with a copy of the RHS SYCL T in accordance with the requirements set out above.

T& operator=(T&& rhs)

Assigns this SYCL T instance with a move of the RHS SYCL T in accordance with the requirements set out above.

~T()

Destroys this SYCL T instance in accordance with the requirements set out in Section 4.5.2. On destruction of the last copy, may perform additional lifetime related operations required for the underlying native backend object specified in the SYCL backend specification document, if this SYCL T instance was originally constructed using one of the backend interoperability make_* functions specified in Section 4.5.1.3. See the relevant backend specification for details.

Table 8. Common hidden friend functions for reference semantics
Hidden friend function Description
bool operator==(const T& lhs, const T& rhs)

Returns true if this LHS SYCL T is equal to the RHS SYCL T in accordance with the requirements set out above, otherwise returns false.

bool operator!=(const T& lhs, const T& rhs)

Returns true if this LHS SYCL T is not equal to the RHS SYCL T in accordance with the requirements set out above, otherwise returns false.

4.5.3. Common by-value semantics

Each of the following SYCL runtime classes: id, range, item, nd_item, h_item, group, sub_group and nd_range must follow the following statements, where T is the runtime class type:

  • T must be default copy constructible and copy assignable on the host application (in the case where T is available on the host) and within SYCL kernel functions.

  • T must be default destructible on the host application (in the case where T is available on the host) and within SYCL kernel functions.

  • T must be default move constructible and default move assignable on the host application (in the case where T is available on the host) and within SYCL kernel functions.

  • T must be equality comparable on the host application (in the case where T is available on the host) and within SYCL kernel functions. Equality between two instances of T (i.e. a == b) must be true if the value of all members are equal and non-equality between two instances of T (i.e. a != b) must be true if the value of any members are not equal, unless either instance has become invalidated by a move operation. By extension of the requirements above, equality on T must guarantee to be reflexive (i.e. a == a), symmetric (i.e. a == b implies b == a and a != b implies b != a) and transitive (i.e. a == b && b == c implies c == a).

Some SYCL runtime classes will have additional behavior associated with copy, movement, assignment or destruction semantics. If these are specified they are in addition to those specified above unless stated otherwise.

Each of the runtime classes mentioned above must provide a common interface of special member functions and member functions in order to fulfill the copy, move, destruction and equality requirements, following the rule of five and the rule of zero.

These common special member functions and hidden friend functions are described in Table 9 and Table 10 respectively.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
namespace sycl {

class T {
  ...

      public
      :
      // If any of the following five special member functions are not
      // public, inline or defaulted, then all five of them should be
      // explicitly declared (see rule of five).
      // Otherwise, none of them should be explicitly declared
      // (see rule of zero).

      // T(const T &rhs);

      // T(T &&rhs);

      // T &operator=(const T &rhs);

      // T &operator=(T &&rhs);

      // ~T();

      ...

      friend bool
      operator==(const T& lhs, const T& rhs) { /* ... */
  }

  friend bool operator!=(const T& lhs, const T& rhs) { /* ... */ }

  ...
};
} // namespace sycl
Table 9. Common special member functions for by-value semantics
Special member function (see rule of five and rule of zero) Description
T(const T& rhs);

Copy constructor.

T(T&& rhs);

Move constructor.

T& operator=(const T& rhs);

Copy assignment operator.

T& operator=(T&& rhs);

Move assignment operator.

~T();

Destructor.

Table 10. Common hidden friend functions for by-value semantics
Hidden friend function Description
bool operator==(const T& lhs, const T& rhs)

Returns true if this LHS SYCL T is equal to the RHS SYCL T in accordance with the requirements set out above, otherwise returns false.

bool operator!=(const T& lhs, const T& rhs)

Returns true if this LHS SYCL T is not equal to the RHS SYCL T in accordance with the requirements set out above, otherwise returns false.

4.5.4. Properties

Each of the following SYCL runtime classes: accessor, buffer, host_accessor, host_sampled_image_accessor, host_unsampled_image_accessor, context, local_accessor, queue, sampled_image, sampled_image_accessor, stream, unsampled_image, unsampled_image_accessor and usm_allocator provide an optional parameter in each of their constructors to provide a property_list which contains zero or more properties. Each of those properties augments the semantics of the class with a particular feature. Each of those classes must also provide has_property and get_property member functions for querying for a particular property.

The listing below illustrates the usage of various buffer properties, described in Section 4.7.2.2.

The example illustrates how using properties does not affect the type of the object, thus, does not prevent the usage of SYCL objects in containers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
  context myContext;

  std::vector<buffer<int, 1>> bufferList {
    buffer<int, 1> { ptr, rng },
    buffer<int, 1> { ptr, rng, property::use_host_ptr {} },
    buffer<int, 1> { ptr, rng, property::context_bound { myContext } }
  };

  for (auto& buf : bufferList) {
    if (buf.has_property<property::context_bound>()) {
      auto prop = buf.get_property<property::context_bound>();
      assert(myContext == prop.get_context());
    }
  }
}

Each property is represented by a unique class and an instance of a property is an instance of that type. Some properties can be default constructed while others will require an argument on construction. A property may be applicable to more than one class, however some properties may not be compatible with each other. See the requirements for the properties of the SYCL buffer class, SYCL unsampled_image class and SYCL sampled_image class in Table 41 and Table 48 respectively.

Properties can be passed to a SYCL runtime class via an instance of property_list. These properties get tied to the SYCL runtime class instance and copies of the object will contain the same properties.

A SYCL implementation or a SYCL backend may provide additional properties other than those defined here, provided they are defined in accordance with the requirements described in Section 4.3.

4.5.4.1. Properties interface

Each of the runtime classes mentioned above must provide a common interface of member functions in order to fulfill the property interface requirements.

A synopsis of the common properties interface, the SYCL property_list class and the SYCL property classes is provided below. The member functions of the common properties interface are listed in Table 12. The constructors of the SYCL property_list class are listed in Table 13.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
namespace sycl {

template <typename Property> struct is_property;

template <typename Property>
inline constexpr bool is_property_v = is_property<Property>::value;

template <typename Property, typename SyclObject> struct is_property_of;

template <typename Property, typename SyclObject>
inline constexpr bool is_property_of_v =
    is_property_of<Property, SyclObject>::value;

class T {
  ...

      template <typename Property>
      bool has_property() const noexcept;

  template <typename Property> Property get_property() const;

  ...
};

class property_list {
 public:
  template <typename... Properties> property_list(Properties... props);
};
} // namespace sycl
Table 11. Traits for properties
Traits Description
template <typename Property> struct is_property

An explicit specialization of is_property that inherits from std::true_type must be provided for each property, where Property is the class defining the property. This includes both standard properties described in this specification and any additional non-standard properties defined by an implementation. All other specializations of is_property must inherit from std::false_type.

template <typename Property>
inline constexpr bool is_property_v;

Variable containing value of is_property<Property>.

template <typename Property, SyclObject> struct is_property_of

An explicit specialization of is_property_of that inherits from std::true_type must be provided for each property that can be used in constructing a given SYCL class, where Property is the class defining the property and SyclObject is the SYCL class. This includes both standard properties described in this specification and any additional non-standard properties defined by an implementation. All other specializations of is_property_of must inherit from std::false_type.

template <typename Property, SyclObject>
inline constexpr bool is_property_of_v;

Variable containing value of is_property_of<Property, SyclObject>.

Table 12. Common member functions of the SYCL property interface
Member function Description
template <typename Property> bool has_property() const noexcept

Returns true if T was constructed with the property specified by Property. Returns false if it was not.

template <typename Property> Property get_property() const

Returns a copy of the property of type Property that T was constructed with. Must throw an exception with the errc::invalid error code if T was not constructed with the Property property.

Table 13. Constructors of the SYCL property_list class
Constructor Description
template <typename... PropertyN> property_list(PropertyN... props)

Available only when: is_property<property>::value evaluates to true where property is each property in PropertyN.

Construct a SYCL property_list with zero or more properties.

4.6. SYCL runtime classes

4.6.1. Device selection

Since a system can have several SYCL-compatible devices attached, it is useful to have a way to select a specific device or a set of devices to construct a specific object such as a device (see Section 4.6.4) or a queue (see Section 4.6.5), or perform some operations on a device subset.

Device selection is done either by already having a specific instance of a device (see Section 4.6.4) or by providing a device selector which is a ranking function that will give an integer ranking value to all the devices on the system.

4.6.1.1. Device selector

The interface for a device selector is any object that meets the C++ named requirement Callable, taking a parameter of type const device & and returning a value that is implicitly convertible to int.

At any point where the SYCL runtime needs to select a SYCL device using a device selector, the system queries all root devices from all SYCL backends in the system, calls the device selector on each device and selects the one which returns the highest score. If the highest value is strictly negative no device is selected.

In places where only one device has to be picked and the high score is obtained by more than one device, then one of the tied devices will be returned, but which one is not defined and may depend on enumeration order, for example, outside the control of the SYCL runtime.

Some predefined device selectors are provided by the system as described on Table 14 in a header file with some definition similar to the following:

Table 14. Standard device selectors included with all SYCL implementations
SYCL device selectors Description
default_selector_v

Select a SYCL device from any supported SYCL backend based on an implementation-defined heuristic. Since all implementations must support at least one device, this selector must always return a device.

Implementations may choose to return an emulated device (with aspect::emulated) as a fallback if there is no physical device available on the system.

gpu_selector_v

Select a SYCL device from any supported SYCL backend for which the device type is info::device_type::gpu. The SYCL class constructor using it must throw an exception with the errc::runtime error code if no device matching this requirement can be found.

accelerator_selector_v

Select a SYCL device from any supported SYCL backend for which the device type is info::device_type::accelerator. The SYCL class constructor using it must throw an exception with the errc::runtime error code if no device matching this requirement can be found.

cpu_selector_v

Select a SYCL device from any supported SYCL backend for which the device type is info::device_type::cpu. The SYCL class constructor using it must throw an exception with the errc::runtime error code if no device matching this requirement can be found.

__unspecified_callable__
aspect_selector(const std::vector<aspect>& aspectList,
                const std::vector<aspect>& denyList = {});

template <typename... AspectList>
__unspecified_callable__ aspect_selector(AspectList... aspectList);

template <aspect... AspectList> __unspecified_callable__ aspect_selector();

The free function aspect_selector has several overloads, each of which returns a selector object that selects a SYCL device from any supported SYCL backend which contains all the requested aspects, i.e. for the specific device dev and each aspect devAspect from aspectList dev.has(devAspect) equals true. If no aspects are passed in, the generated selector behaves like default_selector.

Required aspects can be passed in as a vector, as function arguments, or as template parameters, depending on the function overload. The function overload that takes aspectList as a vector takes another vector argument denyList where the user can specify all the aspects that have to be avoided, i.e. for the specific device dev and each aspect devAspect from denyList dev.has(devAspect) equals false.

The SYCL class constructor using the generated selector must throw an exception with the errc::runtime error code if no device matching this requirement can be found. There are multiple overloads of this function, please refer to [header:device-selector] for full definitions and to [example:aspect-selector] for examples.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
namespace sycl {

// Predefined device selectors
__unspecified__ default_selector_v;
__unspecified__ cpu_selector_v;
__unspecified__ gpu_selector_v;
__unspecified__ accelerator_selector_v;

// Predefined types for compatibility with old SYCL 1.2.1 device selectors
using default_selector = __unspecified__;
using cpu_selector = __unspecified__;
using gpu_selector = __unspecified__;
using accelerator_selector = __unspecified__;

// Returns a selector that selects a device based on desired aspects
__unspecified_callable__
aspect_selector(const std::vector<aspect>& aspectList,
                const std::vector<aspect>& denyList = {});
template <class... AspectList>
__unspecified_callable__ aspect_selector(AspectList... aspectList);
template <aspect... AspectList> __unspecified_callable__ aspect_selector();

} // namespace sycl

Typical examples of default and user-provided device selectors could be:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
sycl::device my_gpu { sycl::gpu_selector_v };

sycl::queue my_accelerator { sycl::accelerator_selector_v };

int prefer_my_vendor(const sycl::device& d) {
  // Return 1 if the vendor name is "MyVendor" or 0 else.
  // 0 does not prevent another device to be picked as a second choice
  return d.get_info<info::device::vendor>() == "MyVendor";
}

// Get the preferred device or another one if not available
sycl::device preferred_device { prefer_my_vendor };

// This throws if there is no such device in the system
sycl::queue half_precision_controller {
  // Can use a lambda as a device ranking function.
  // Returns a negative number to fail in the case there is no such device
  [] (auto& d) { return d.has(sycl::aspect::fp16) ? 1 : -1; }
};

// To ease porting SYCL 1.2.1 code, there are types whose
// construction leads to the equivalent predefined device selector
sycl::queue my_old_style_gpu { sycl::gpu_selector {} };

Examples of using aspect_selector:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
using namespace sycl; // (optional) avoids need for "sycl::" before SYCL names

// Unrestrained selection, equivalent to default_selector
auto dev0 = device{aspect_selector()};

// Pass aspects in a vector
// Only accept CPUs that support half
auto dev1 = device{aspect_selector(std::vector{aspect::cpu, aspect::fp16})};

// Pass aspects without a vector
// Only accept GPUs that support half
auto dev2 = device{aspect_selector(aspect::gpu, aspect::fp16)};

// Pass aspects as compile-time parameters
// Only accept devices that can be debugged on host and support half
auto dev3 = device{aspect_selector<aspect::host_debuggable, aspect::fp16>()};

// Pass aspects in an allowlist and a denylist
// Only accept devices that support half and double floating point precision,
// but exclude emulated devices and devices of type "custom"
auto dev4 = device{aspect_selector(
   std::vector{aspect::fp16, aspect::fp64},
   std::vector{aspect::emulated, aspect::custom}
)};

In SYCL 1.2.1 the predefined device selectors were actually types that had to be instantiated to be used. Now they are just instances. To simplify porting code using the old type instantiations, a backward-compatible API is still provided, such as sycl::default_selector. The new predefined device selectors have their new names appended with "_v" to avoid conflicts, thus following the naming style used by traits in the C++ standard library. There is no requirement for the implementation to have for example sycl::gpu_selector_v being an instance of sycl::gpu_selector.

Implementation note: the SYCL API might rely on SFINAE or C++20 concepts to resolve some ambiguity in constructors with default parameters.

4.6.2. Platform class

The SYCL platform class encapsulates a single SYCL platform on which SYCL kernel functions may be executed. A SYCL platform must be associated with a single SYCL backend.

A SYCL platform is also associated with one or more SYCL devices associated with the same SYCL backend.

All member functions of the platform class are synchronous and errors are handled by throwing synchronous SYCL exceptions.

The execution environment for a SYCL application has a fixed number of platforms which does not vary as the application executes. The application can get a list of all these platforms via platform::get_platforms(), and the order of the platform objects is the same each time the application calls that function. The platform class also provides constructors, but constructing a new platform instance merely creates a new object that is a copy of one of the objects returned by platform::get_platforms().

The SYCL platform class provides the common reference semantics (see Section 4.5.2).

4.6.2.1. Platform interface

A synopsis of the SYCL platform class is provided below. The constructors, member functions and static member functions of the SYCL platform class are listed in Table 15, Table 16 and Table 17 respectively. The additional common special member functions and common member functions are listed in Section 4.5.2 in Table 7 and Table 8 respectively.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
namespace sycl {
class platform {
 public:
  platform();

  template <typename DeviceSelector>
  explicit platform(const DeviceSelector& deviceSelector);

  /* -- common interface members -- */

  backend get_backend() const noexcept;

  std::vector<device>
      get_devices(info::device_type = info::device_type::all) const;

  template <typename Param> typename Param::return_type get_info() const;

  template <typename Param>
  typename Param::return_type get_backend_info() const;

  bool has(aspect asp) const;

  bool has_extension(const std::string& extension) const; // Deprecated

  static std::vector<platform> get_platforms();
};
} // namespace sycl
Table 15. Constructors of the SYCL platform class
Constructor Description
platform()

Constructs a SYCL platform instance that is a copy of the platform which contains the device returned by default_selector_v.

template <typename DeviceSelector> explicit platform(const DeviceSelector&)

Constructs a SYCL platform instance that is a copy of the platform which contains the device returned by the device selector parameter.

Table 16. Member functions of the SYCL platform class
Member function Description
backend get_backend() const noexcept

Returns a backend identifying the SYCL backend associated with this platform.

template <typename Param> typename Param::return_type get_info() const

Queries this SYCL platform for information requested by the template parameter Param. The type alias Param::return_type must be defined in accordance with the info parameters in Table 18 to facilitate returning the type associated with the Param parameter.

template <typename Param> typename Param::return_type get_backend_info() const

Queries this SYCL platform for SYCL backend-specific information requested by the template parameter Param. The type alias Param::return_type must be defined in accordance with the SYCL backend specification. Must throw an exception with the errc::backend_mismatch error code if the SYCL backend that corresponds with Param is different from the SYCL backend that is associated with this platform.

bool has(aspect asp) const

Returns true if all of the SYCL devices associated with this SYCL platform have the given aspect.

bool has_extension(const std::string& extension) const

Deprecated, use has() instead.

Returns true if this SYCL platform supports the extension queried by the extension parameter. A SYCL platform can only support an extension if all associated SYCL devices support that extension.

std::vector<device>
get_devices(info::device_type deviceType = info::device_type::all) const

Returns a std::vector containing all the root devices associated with this SYCL platform which have the device type encapsulated by deviceType.

Table 17. Static member functions of the SYCL platform class
Static member function Description
static std::vector<platform> get_platforms()

Returns a std::vector containing all SYCL platforms from all SYCL backends available in the system.

4.6.2.2. Platform information descriptors

A platform can be queried for information using the get_info member function of the platform class, specifying one of the info parameters in info::platform. The possible values for each info parameter and any restrictions are defined in the specification of the SYCL backend associated with the platform. All info parameters in info::platform are specified in Table 18 and the synopsis for info::platform is described in Section A.1.

Table 18. Platform information descriptors
Platform descriptors Return type Description
info::platform::version

std::string

Returns a backend-defined platform version.

info::platform::name

std::string

Returns the name of the platform.

info::platform::vendor

std::string

Returns the name of the vendor providing the platform.

info::platform::extensions

std::vector<std::string>

Deprecated, use device::get_info() with info::device::aspects instead.

Returns the extensions supported by the platform.

4.6.3. Context class

The context class represents a SYCL context. A context represents the runtime data structures and state required by a SYCL backend API to interact with a group of devices associated with a platform.

The SYCL context class provides the common reference semantics (see Section 4.5.2).

4.6.3.1. Context interface

The constructors and member functions of the SYCL context class are listed in Table 19 and Table 20, respectively. The additional common special member functions and common member functions are listed in Section 4.5.2 in Table 7 and Table 8, respectively.

All member functions of the context class are synchronous and errors are handled by throwing synchronous SYCL exceptions.

All constructors of the SYCL context class will construct an instance associated with a particular SYCL backend, determined by the constructor parameters or, in the case of the default constructor, the SYCL device produced by the default_selector_v.

A SYCL context can optionally be constructed with an async_handler parameter. In this case the async_handler is used to report asynchronous SYCL exceptions, as described in Section 4.13.

Information about a SYCL context may be queried through the get_info() member function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
namespace sycl {
class context {
 public:
  explicit context(const property_list& propList = {});

  explicit context(async_handler asyncHandler,
                   const property_list& propList = {});

  explicit context(const device& dev, const property_list& propList = {});

  explicit context(const device& dev, async_handler asyncHandler,
                   const property_list& propList = {});

  explicit context(const std::vector<device>& deviceList,
                   const property_list& propList = {});

  explicit context(const std::vector<device>& deviceList,
                   async_handler asyncHandler,
                   const property_list& propList = {});

  /* -- property interface members -- */

  /* -- common interface members -- */

  backend get_backend() const noexcept;

  platform get_platform() const;

  std::vector<device> get_devices() const;

  template <typename Param> typename Param::return_type get_info() const;

  template <typename Param>
  typename Param::return_type get_backend_info() const;
};
} // namespace sycl
Table 19. Constructors of the SYCL context class
Constructor Description
explicit context(async_handler asyncHandler = {})

Constructs a SYCL context instance using an instance of default_selector_v to select the associated SYCL platform and device(s). The devices that are associated with the constructed context are implementation-defined but must contain the device chosen by the device selector. The constructed SYCL context will use the asyncHandler parameter to handle exceptions.

explicit context(const device& dev, async_handler asyncHandler = {})

Constructs a SYCL context instance using the dev parameter as the associated SYCL device and the SYCL platform associated with the dev parameter as the associated SYCL platform. The constructed SYCL context will use the asyncHandler parameter to handle exceptions.

explicit context(const std::vector<device>& deviceList,
                 async_handler asyncHandler = {})

Constructs a SYCL context instance using the SYCL device(s) in the deviceList parameter as the associated SYCL device(s) and the SYCL platform associated with each SYCL device in the deviceList parameter as the associated SYCL platform. This requires that all SYCL devices in the deviceList parameter have the same associated SYCL platform. The constructed SYCL context will use the asyncHandler parameter to handle exceptions.

Table 20. Member functions of the context class
Member function Description
backend get_backend() const noexcept

Returns a backend identifying the SYCL backend associated with this context.

template <typename Param> typename Param::return_type get_info() const

Queries this SYCL context for information requested by the template parameter Param. The type alias Param::return_type must be defined in accordance with the info parameters in Table 21 to facilitate returning the type associated with the Param parameter.

template <typename Param> typename Param::return_type get_backend_info() const

Queries this SYCL context for SYCL backend-specific information requested by the template parameter Param. The type alias Param::return_type must be defined in accordance with the SYCL backend specification. Must throw an exception with the errc::backend_mismatch error code if the SYCL backend that corresponds with Param is different from the SYCL backend that is associated with this context.

platform get_platform() const

Returns the SYCL platform that is associated with this SYCL context. The value returned must be equal to that returned by get_info<info::context::platform>().

std::vector<device> get_devices() const

Returns a std::vector containing all SYCL devices that are associated with this SYCL context. The value returned must be equal to that returned by get_info<info::context::devices>().

4.6.3.2. Context information descriptors

A context can be queried for information using the get_info member function of the context class, specifying one of the info parameters in info::context. The possible values for each info parameter and any restrictions are defined in the specification of the SYCL backend associated with the context. All info parameters in info::context are specified in Table 21 and the synopsis for info::context is described in Section A.2.

Table 21. Context information descriptors
Context Descriptors Return type Description
info::context::platform

platform

Returns the platform associated with the context.

info::context::devices

std::vector<device>

Returns all of the devices associated with the context.

info::context::atomic_memory_order_capabilities

std::vector<memory_order>

This query applies only to the capabilities of atomic operations that are applied to memory that can be concurrently accessed by multiple devices in the context. If these capabilities are not uniform across all devices in the context, the query reports only the capabilities that are common for all devices.

Returns the set of memory orders supported by these atomic operations. When a context returns a "stronger" memory order in this set, it must also return all "weaker" memory orders. (See Section 3.8.3.1 for a definition of "stronger" and "weaker" memory orders.) The memory orders memory_order::acquire, memory_order::release, and memory_order::acq_rel are all the same strength. If a context returns one of these, it must return them all.

At a minimum, each context must support memory_order::relaxed.

info::context::atomic_fence_order_capabilities

std::vector<memory_order>

This query applies only to the capabilities of atomic_fence when applied to memory that can be concurrently accessed by multiple devices in the context. If these capabilities are not uniform across all devices in the context, the query reports only the capabilities that are common for all devices.

Returns the set of memory orders supported by these atomic_fence operations. When a context returns a "stronger" memory order in this set, it must also return all "weaker" memory orders. (See Section 3.8.3.1 for a definition of "stronger" and "weaker" memory orders.)

At a minimum, each context must support memory_order::relaxed, memory_order::acquire, memory_order::release, and memory_order::acq_rel.

info::context::atomic_memory_scope_capabilities

std::vector<memory_scope>

Returns the set of memory scopes supported by atomic operations on all devices in the context. When a context returns a "wider" memory scope in this set, it must also return all "narrower" memory scopes. (See Section 3.8.3.2 for a definition of "wider" and "narrower" scopes.) At a minimum, each context must support memory_scope::work_item, memory_scope::sub_group, and memory_scope::work_group.

info::context::atomic_fence_scope_capabilities

std::vector<memory_scope>

Returns the set of memory orderings supported by atomic_fence on all devices in the context. When a context returns a "wider" memory scope in this set, it must also return all "narrower" memory scopes. (See Section 3.8.3.2 for a definition of "wider" and "narrower" scopes.) At a minimum, each context must support memory_scope::work_item, memory_scope::sub_group, and memory_scope::work_group.

4.6.3.3. Context properties

The property_list constructor parameters are present for extensibility.

4.6.4. Device class

The SYCL device class encapsulates a single SYCL device on which kernels can be executed.

All member functions of the device class are synchronous and errors are handled by throwing synchronous SYCL exceptions.

The execution environment for a SYCL application has a fixed number of root devices which does not vary as the application executes. The application can get a list of all these devices via device::get_devices(), and the order of the device objects is the same each time the application calls that function (assuming the parameter to that function is the same for each call). The device class also provides constructors, but constructing a new device instance merely creates a new object that is a copy of one of the objects returned by device::get_devices().

A SYCL device can be partitioned into multiple SYCL devices, by calling the create_sub_devices() member function template. The resulting SYCL devices are considered sub devices, and it is valid to partition these sub devices further. The range of support for this feature is SYCL backend and device specific and can be queried for through get_info().

The SYCL device class provides the common reference semantics (see Section 4.5.2).

4.6.4.1. Device interface

A synopsis of the SYCL device class is provided below. The constructors, member functions and static member functions of the SYCL device class are listed in Table 22, Table 23 and Table 24 respectively. The additional common special member functions and common member functions are listed in Section 4.5.2 in Table 7 and Table 8, respectively.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
namespace sycl {

class device {
 public:
  device();

  template <typename DeviceSelector>
  explicit device(const DeviceSelector& deviceSelector);

  /* -- common interface members -- */

  backend get_backend() const noexcept;

  bool is_cpu() const;

  bool is_gpu() const;

  bool is_accelerator() const;

  platform get_platform() const;

  template <typename Param> typename Param::return_type get_info() const;

  template <typename Param>
  typename Param::return_type get_backend_info() const;

  bool has(aspect asp) const;

  bool has_extension(const std::string& extension) const; // Deprecated

  // Available only when Prop == info::partition_property::partition_equally
  template <info::partition_property Prop>
  std::vector<device> create_sub_devices(size_t count) const;

  // Available only when Prop == info::partition_property::partition_by_counts
  template <info::partition_property Prop>
  std::vector<device>
  create_sub_devices(const std::vector<size_t>& counts) const;

  // Available only when Prop ==
  // info::partition_property::partition_by_affinity_domain
  template <info::partition_property Prop>
  std::vector<device>
  create_sub_devices(info::partition_affinity_domain affinityDomain) const;

  static std::vector<device>
  get_devices(info::device_type deviceType = info::device_type::all);
};
} // namespace sycl
Table 22. Constructors of the SYCL device class
Constructor Description
device()

Constructs a SYCL device instance that is a copy of the device returned by default_selector_v.

template <typename DeviceSelector> explicit device(const DeviceSelector&)

Constructs a SYCL device instance that is a copy of the device returned by the device selector parameter.

Table 23. Member functions of the SYCL device class
Member function Description
backend get_backend() const noexcept

Returns a backend identifying the SYCL backend associated with this device.

platform get_platform() const

Returns the associated SYCL platform. The value returned must be equal to that returned by get_info<info::device::platform>().

bool is_cpu() const

Returns the same value as has(aspect::cpu). See Table 26.

bool is_gpu() const

Returns the same value as has(aspect::gpu). See Table 26.

bool is_accelerator() const

Returns the same value as has(aspect::accelerator). See Table 26.

template <typename Param> typename Param::return_type get_info() const

Queries this SYCL device for information requested by the template parameter Param. The type alias Param::return_type must be defined in accordance with the info parameters in Table 25 to facilitate returning the type associated with the Param parameter.

template <typename Param> typename Param::return_type get_backend_info() const

Queries this SYCL device for SYCL backend-specific information requested by the template parameter Param. The type alias Param::return_type must be defined in accordance with the SYCL backend specification. Must throw an exception with the errc::backend_mismatch error code if the SYCL backend that corresponds with Param is different from the SYCL backend that is associated with this device.

bool has(aspect asp) const

Returns true if this SYCL device has the given aspect. SYCL applications can use this member function to determine which optional features this device supports (if any).

bool has_extension(const std::string& extension) const

Deprecated, use has() instead.

Returns true if this SYCL device supports the extension queried by the extension parameter.

template <info::partition_property Prop>
std::vector<device> create_sub_devices(size_t count) const

Available only when Prop is info::partition_property::partition_equally. Returns a std::vector of sub devices partitioned from this SYCL device based on the count parameter. The returned vector contains as many sub devices as can be created such that each sub device contains count compute units. If the device’s total number of compute units (as returned by info::device::max_compute_units) is not evenly divided by count, then the remaining compute units are not included in any of the sub devices.

If this SYCL device does not support info::partition_property::partition_equally an exception with the errc::feature_not_supported error code must be thrown. If count exceeds the total number of compute units in the device, an exception with the errc::invalid error code must be thrown.

template <info::partition_property Prop>
std::vector<device> create_sub_devices(const std::vector<size_t>& counts) const

Available only when Prop is info::partition_property::partition_by_counts. Returns a std::vector of sub devices partitioned from this SYCL device based on the counts parameter. For each non-zero value M in the counts vector, a sub device with M compute units is created.

If the SYCL device does not support info::partition_property::partition_by_counts an exception with the errc::feature_not_supported error code must be thrown. If the number of non-zero values in counts exceeds the device’s maximum number of sub devices (as returned by info::device::partition_max_sub_devices) or if the total of all the values in the counts vector exceeds the total number of compute units in the device (as returned by info::device::max_compute_units), an exception with the errc::invalid error code must be thrown.

template <info::partition_property Prop>
std::vector<device>
create_sub_devices(info::partition_affinity_domain domain) const

Available only when Prop is info::partition_property::partition_by_affinity_domain. Returns a std::vector of sub devices partitioned from this SYCL device by affinity domain based on the domain parameter, which must be one of the following values:

  • info::partition_affinity_domain::numa: Split the device into sub devices comprised of compute units that share a NUMA node.

  • info::partition_affinity_domain::L4_cache: Split the device into sub devices comprised of compute units that share a level 4 data cache.

  • info::partition_affinity_domain::L3_cache: Split the device into sub devices comprised of compute units that share a level 3 data cache.

  • info::partition_affinity_domain::L2_cache: Split the device into sub devices comprised of compute units that share a level 2 data cache.

  • info::partition_affinity_domain::L1_cache: Split the device into sub devices comprised of compute units that share a level 1 data cache.

  • info::partition_affinity_domain::next_partitionable: Split the device along the next partitionable affinity domain. The implementation shall find the first level along which the device or sub device may be further subdivided in the order numa, L4_cache, L3_cache, L2_cache, L1_cache, and partition the device into sub devices comprised of compute units that share memory subsystems at this level. The user may determine what happened via info::device::partition_type_affinity_domain.

If the SYCL device does not support info::partition_property::partition_by_affinity_domain or the SYCL device does not support the info::partition_affinity_domain provided, an exception with the errc::feature_not_supported error code must be thrown.

Table 24. Static member functions of the SYCL device class
Static member function Description
static std::vector<device>
get_devices(info::device_type deviceType = info::device_type::all)

Returns a std::vector containing all the root devices from all SYCL backends available in the system which have the device type encapsulated by deviceType.

4.6.4.2. Device information descriptors

A device can be queried for information using the get_info member function of the device class, specifying one of the info parameters in info::device. The possible values for each info parameter and any restriction are defined in the specification of the SYCL backend associated with the device. All info parameters in info::device are specified in Table 25 and the synopsis for info::device is described in Section A.3.

Table 25. Device information descriptors
Device descriptors Return type Description
info::device::device_type

info::device_type

Returns the device type associated with the device. May not return info::device_type::all.

info::device::vendor_id

uint32_t

Returns a unique vendor device identifier.

info::device::max_compute_units

uint32_t

Returns the number of parallel compute units available to the device. The minimum value is 1.

info::device::max_work_item_dimensions

uint32_t

Returns the maximum dimensions that specify the global and local work-item IDs used by the data parallel execution model. The minimum value is 3 if this SYCL device is not of device type info::device_type::custom.

info::device::max_work_item_sizes<1>

range<1>

Returns the maximum number of work-items that are permitted in a work-group for a kernel running in a one-dimensional index space. The minimum value is for devices that are not of device type info::device_type::custom.

info::device::max_work_item_sizes<2>

range<2>

Returns the maximum number of work-items that are permitted in each dimension of a work-group for a kernel running in a two-dimensional index space. The minimum value is for devices that are not of device type info::device_type::custom.

info::device::max_work_item_sizes<3>

range<3>

Returns the maximum number of work-items that are permitted in each dimension of a work-group for a kernel running in a three-dimensional index space. The minimum value is for devices that are not of device type info::device_type::custom.

info::device::max_work_group_size

size_t

Returns the maximum number of work-items that are permitted in a work-group executing a kernel on a single compute unit. The minimum value is 1.

info::device::max_num_sub_groups

uint32_t

Returns the maximum number of sub-groups in a work-group for any kernel executed on the device. The minimum value is 1.

info::device::sub_group_sizes

std::vector<size_t>

Returns a std::vector of size_t containing the set of sub-group sizes supported by the device.

info::device::preferred_vector_width_char
info::device::preferred_vector_width_short
info::device::preferred_vector_width_int
info::device::preferred_vector_width_long
info::device::preferred_vector_width_float
info::device::preferred_vector_width_double
info::device::preferred_vector_width_half

uint32_t

Returns the preferred native vector width size for built-in scalar types that can be put into vectors. The vector width is defined as the number of scalar elements that can be stored in the vector. Must return 0 for info::device::preferred_vector_width_double if the device does not have aspect::fp64 and must return 0 for info::device::preferred_vector_width_half if the device does not have aspect::fp16.

info::device::native_vector_width_char
info::device::native_vector_width_short
info::device::native_vector_width_int
info::device::native_vector_width_long
info::device::native_vector_width_float
info::device::native_vector_width_double
info::device::native_vector_width_half

uint32_t

Returns the native ISA vector width. The vector width is defined as the number of scalar elements that can be stored in the vector. Must return 0 for info::device::native_vector_width_double if the device does not have aspect::fp64 and must return 0 for info::device::native_vector_width_half if the device does not have aspect::fp16.

info::device::max_clock_frequency

uint32_t

Returns the maximum configured clock frequency of this SYCL device in MHz.

info::device::address_bits

uint32_t

Returns the default compute device address space size specified as an unsigned integer value in bits. Must return either 32 or 64.

info::device::max_mem_alloc_size

uint64_t

Returns the maximum size of memory object allocation in bytes. The minimum value is max (1/4th of info::device::global_mem_size,128*1024*1024) if this SYCL device is not of device type info::device_type::custom.

info::device::image_support

bool

Deprecated.

Returns the same value as device::has(aspect::image).

info::device::max_read_image_args

uint32_t

Returns the maximum number of simultaneous image objects that can be read from by a kernel. The minimum value is 128 if the SYCL device has aspect::image.

info::device::max_write_image_args

uint32_t

Returns the maximum number of simultaneous image objects that can be written to by a kernel. The minimum value is 8 if the SYCL device has aspect::image.

info::device::image2d_max_width

size_t

Returns the maximum width of a 2D image or 1D image in pixels. The minimum value is 8192 if the SYCL device has aspect::image.

info::device::image2d_max_height

size_t

Returns the maximum height of a 2D image in pixels. The minimum value is 8192 if the SYCL device has aspect::image.

info::device::image3d_max_width

size_t

Returns the maximum width of a 3D image in pixels. The minimum value is 2048 if the SYCL device has aspect::image.

info::device::image3d_max_height

size_t

Returns the maximum height of a 3D image in pixels. The minimum value is 2048 if the SYCL device has aspect::image.

info::device::image3d_max_depth

size_t

Returns the maximum depth of a 3D image in pixels. The minimum value is 2048 if the SYCL device has aspect::image.

info::device::image_max_buffer_size

size_t

Returns the number of pixels for a 1D image created from a buffer object. The minimum value is 65536 if the SYCL device has aspect::image. Note that this information is intended for OpenCL interoperability only as this feature is not supported in SYCL.

info::device::max_samplers

uint32_t

Returns the maximum number of samplers that can be used in a kernel. The minimum value is 16 if the SYCL device has aspect::image.

info::device::max_parameter_size

size_t

Returns the maximum size in bytes of the arguments that can be passed to a kernel. The minimum value is 1024 if this SYCL device is not of device type info::device_type::custom. For this minimum value, only a maximum of 128 arguments can be passed to a kernel.

info::device::mem_base_addr_align

uint32_t

Returns the minimum value in bits of the largest supported SYCL built-in data type if this SYCL device is not of device type info::device_type::custom.

info::device::half_fp_config

std::vector<info::fp_config>

Returns a std::vector of info::fp_config describing the half precision floating-point capability of this SYCL device. The std::vector may contain zero or more of the following values:

  • info::fp_config::denorm: denorms are supported.

  • info::fp_config::inf_nan: INF and quiet NaNs are supported.

  • info::fp_config::round_to_nearest: round to nearest even rounding mode is supported.

  • info::fp_config::round_to_zero: round to zero rounding mode is supported.

  • info::fp_config::round_to_inf: round to positive and negative infinity rounding modes are supported.

  • info::fp_config::fma: IEEE754-2008 fused multiply add is supported.

  • info::fp_config::correctly_rounded_divide_sqrt: divide and sqrt are correctly rounded as defined by the IEEE754 specification. This property is deprecated.

  • info::fp_config::soft_float: basic floating-point operations (such as addition, subtraction, multiplication) are implemented in software.

If half precision is supported by this SYCL device (i.e. the device has aspect::fp16 there is no minimum floating-point capability. If half support is not supported the returned std::vector must be empty.

info::device::single_fp_config

std::vector<info::fp_config>

Returns a std::vector of info::fp_config describing the single precision floating-point capability of this SYCL device. The std::vector must contain one or more of the following values:

  • info::fp_config::denorm: denorms are supported.

  • info::fp_config::inf_nan: INF and quiet NaNs are supported.

  • info::fp_config::round_to_nearest: round to nearest even rounding mode is supported.

  • info::fp_config::round_to_zero: round to zero rounding mode is supported.

  • info::fp_config::round_to_inf: round to positive and negative infinity rounding modes are supported.

  • info::fp_config::fma: IEEE754-2008 fused multiply add is supported.

  • info::fp_config::correctly_rounded_divide_sqrt: divide and sqrt are correctly rounded as defined by the IEEE754 specification. This property is deprecated.

  • info::fp_config::soft_float: basic floating-point operations (such as addition, subtraction, multiplication) are implemented in software.

If this SYCL device is not of type info::device_type::custom then the minimum floating-point capability must be: info::fp_config::round_to_nearest and info::fp_config::inf_nan.

info::device::double_fp_config

std::vector<info::fp_config>

Returns a std::vector of info::fp_config describing the double precision floating-point capability of this SYCL device. The std::vector may contain zero or more of the following values:

  • info::fp_config::denorm: denorms are supported.

  • info::fp_config::inf_nan: INF and NaNs are supported.

  • info::fp_config::round_to_nearest: round to nearest even rounding mode is supported.

  • info::fp_config::round_to_zero: round to zero rounding mode is supported.

  • info::fp_config::round_to_inf: round to positive and negative infinity rounding modes are supported.

  • info::fp_config::fma: IEEE754-2008 fused multiply-add is supported.

  • info::fp_config::soft_float: basic floating-point operations (such as addition, subtraction, multiplication) are implemented in software.

If double precision is supported by this SYCL device (i.e. the device has aspect::fp64 and this SYCL device is not of type info::device_type::custom then the minimum floating-point capability must be: info::fp_config::fma, info::fp_config::round_to_nearest, info::fp_config::round_to_zero, info::fp_config::round_to_inf, info::fp_config::inf_nan and info::fp_config::denorm. If double support is not supported the returned std::vector must be empty.

info::device::global_mem_cache_type

info::global_mem_cache_type

Returns the type of global memory cache supported.

info::device::global_mem_cache_line_size

uint32_t

Returns the size of global memory cache line in bytes.

info::device::global_mem_cache_size

uint64_t

Returns the size of global memory cache in bytes.

info::device::global_mem_size

uint64_t

Returns the size of global device memory in bytes.

info::device::max_constant_buffer_size

uint64_t

Deprecated in SYCL 2020. Returns the maximum size in bytes of a constant buffer allocation. The minimum value is 64 KB if this SYCL device is not of type info::device_type::custom.

info::device::max_constant_args

uint32_t

Deprecated in SYCL 2020. Returns the maximum number of constant arguments that can be declared in a kernel. The minimum value is 8 if this SYCL device is not of type info::device_type::custom.

info::device::local_mem_type

info::local_mem_type

Returns the type of local memory supported. This can be info::local_mem_type::local implying dedicated local memory storage such as SRAM, or info::local_mem_type::global. If this SYCL device is of type info::device_type::custom this can also be info::local_mem_type::none, indicating local memory is not supported.

info::device::local_mem_size

uint64_t

Returns the size of local memory arena in bytes. The minimum value is 32 KB if this SYCL device is not of type info::device_type::custom.

info::device::error_correction_support

bool

Returns true if the device implements error correction for all accesses to compute device memory (global and constant). Returns false if the device does not implement such error correction.

info::device::host_unified_memory

bool

Deprecated, use device::has() with one of the aspect::usm_* aspects instead.

Returns true if the device and the host have a unified memory subsystem and returns false otherwise.

info::device::atomic_memory_order_capabilities

std::vector<memory_order>

Returns the set of memory orders supported by atomic operations on the device. When a device returns a "stronger" memory order in this set, it must also return all "weaker" memory orders. (See Section 3.8.3.1 for a definition of "stronger" and "weaker" memory orders.) The memory orders memory_order::acquire, memory_order::release, and memory_order::acq_rel are all the same strength. If a device returns one of these, it must return them all.

At a minimum, each device must support memory_order::relaxed.

info::device::atomic_fence_order_capabilities

std::vector<memory_order>

Returns the set of memory orders supported by atomic_fence on the device. When a device returns a "stronger" memory order in this set, it must also return all "weaker" memory orders. (See Section 3.8.3.1 for a definition of "stronger" and "weaker" memory orders.) At a minimum, each device must support memory_order::relaxed, memory_order::acquire, memory_order::release, and memory_order::acq_rel.

info::device::atomic_memory_scope_capabilities

std::vector<memory_scope>

Returns the set of memory scopes supported by atomic operations on the device. When a device returns a "wider" memory scope in this set, it must also return all "narrower" memory scopes. (See Section 3.8.3.2 for a definition of "wider" and "narrower" scopes.) At a minimum, each device must support memory_scope::work_item, memory_scope::sub_group, and memory_scope::work_group.

info::device::atomic_fence_scope_capabilities

std::vector<memory_scope>

Returns the set of memory scopes supported by atomic_fence on the device. When a device returns a "wider" memory scope in this set, it must also return all "narrower" memory scopes. (See Section 3.8.3.2 for a definition of "wider" and "narrower" scopes.) At a minimum, each device must support memory_scope::work_item, memory_scope::sub_group, and memory_scope::work_group.

info::device::profiling_timer_resolution

size_t

Returns the resolution of device timer in nanoseconds.

info::device::is_endian_little

bool

Deprecated. Check the byte order of the host system instead. The host and device are required to have the same byte order.

Returns true if this SYCL device is a little endian device and returns false otherwise.

info::device::is_available

bool

Returns true if the SYCL device is available and returns false if the device is not available.

info::device::is_compiler_available

bool

Deprecated.

Returns the same value as device::has(aspect::online_compiler).

info::device::is_linker_available

bool

Deprecated.

Returns the same value as device::has(aspect::online_linker).

info::device::execution_capabilities

std::vector<info::execution_capability>

Returns a std::vector of the info::execution_capability describing the supported execution capabilities. Note that this information is intended for OpenCL interoperability only as SYCL only supports info::execution_capability::exec_kernel.

info::device::queue_profiling

bool

Deprecated.

Returns the same value as device::has(aspect::queue_profiling).

info::device::built_in_kernel_ids

std::vector<kernel_id>

Returns a std::vector of identifiers for the built-in kernels supported by this SYCL device.

info::device::built_in_kernels

std::vector<std::string>

Deprecated. Use info::device::built_in_kernel_ids instead.

Returns a std::vector of built-in OpenCL kernels supported by this SYCL device.

info::device::platform

platform

Returns the SYCL platform associated with this SYCL device.

info::device::name

std::string

Returns the device name of this SYCL device.

info::device::vendor

std::string

Returns the vendor of this SYCL device.

info::device::driver_version

std::string

Returns a vendor-defined string describing the version of the underlying backend software driver.

info::device::profile

std::string

Deprecated in SYCL 2020. Only supported when using the OpenCL backend (see Appendix C). Throws an exception with the errc::invalid error code if used with a device whose backend is not OpenCL.

The value returned can be one of the following strings:

  • FULL_PROFILE - if the device supports the OpenCL specification (functionality defined as part of the core specification and does not require any extensions to be supported).

  • EMBEDDED_PROFILE - if the device supports the OpenCL embedded profile.

info::device::version

std::string

Returns a backend-defined device version.

info::device::backend_version

std::string

Returns a string describing the version of the SYCL backend associated with the device. The possible values are specified in the SYCL backend specification of the SYCL backend associated with the device.

info::device::aspects

std::vector<aspect>

Returns a std::vector of aspect values supported by this SYCL device.

info::device::extensions

std::vector<std::string>

Deprecated, use info::device::aspects instead.

Returns a std::vector of extension names (the extension names do not contain any spaces) supported by this SYCL device. The extension names returned can be vendor supported extension names and one or more of the following Khronos approved extension names:

  • cl_khr_int64_base_atomics

  • cl_khr_int64_extended_atomics

  • cl_khr_3d_image_writes

  • cl_khr_fp16

  • cl_khr_gl_sharing

  • cl_khr_gl_event

  • cl_khr_d3d10_sharing

  • cl_khr_dx9_media_sharing

  • cl_khr_d3d11_sharing

  • cl_khr_depth_images

  • cl_khr_gl_depth_images

  • cl_khr_gl_msaa_sharing

  • cl_khr_image2d_from_buffer

  • cl_khr_initialize_memory

  • cl_khr_context_abort

  • cl_khr_spir

If this SYCL device is an OpenCL device then following approved Khronos extension names must be returned by all device that support OpenCL C 1.2:

  • cl_khr_global_int32_base_atomics

  • cl_khr_global_int32_extended_atomics

  • cl_khr_local_int32_base_atomics

  • cl_khr_local_int32_extended_atomics

  • cl_khr_byte_addressable_store

  • cl_khr_fp64 (for backward compatibility if double precision is supported)

Please refer to the OpenCL 1.2 Extension Specification for a detailed description of these extensions.

info::device::printf_buffer_size

size_t

Deprecated in SYCL 2020.

Returns the maximum size of the internal buffer that holds the output of printf calls from a kernel. The minimum value is 1 MB if info::device::profile returns true for this SYCL device.

info::device::preferred_interop_user_sync

bool

Deprecated in SYCL 2020. Only supported when using the OpenCL backend (see Appendix C). Throws an exception with the errc::invalid error code if used with a device whose backend is not OpenCL.

Returns true if the preference for this SYCL device is for the user to be responsible for synchronization, when sharing memory objects between OpenCL and other APIs such as DirectX, false if the device/implementation has a performant path for performing synchronization of memory object shared between OpenCL and other APIs such as DirectX.

info::device::parent_device

device

Returns the parent SYCL device to which this sub-device is a child if this is a sub-device. Must throw an exception with the errc::invalid error code if this SYCL device is not a sub device.

info::device::partition_max_sub_devices

uint32_t

Returns the maximum number of sub-devices that can be created when this SYCL device is partitioned. The value returned cannot exceed the value returned by info::device::device_max_compute_units.

info::device::partition_properties

std::vector<info::partition_property>

Returns the partition properties supported by this SYCL device; a vector of info::partition_property. An element is returned in this vector only if the device can be partitioned into at least two sub devices along that partition property.

info::device::partition_affinity_domains

std::vector<info::partition_affinity_domain>

Returns a std::vector of the partition affinity domains supported by this SYCL device when partitioning with info::partition_property::partition_by_affinity_domain. An element is returned in this vector only if the device can be partitioned into at least two sub devices along that affinity domain.

info::device::partition_type_property

info::partition_property

Returns the partition property of this SYCL device. If this SYCL device is not a sub device then the return value must be info::partition_property::no_partition, otherwise it must be one of the following values:

  • info::partition_property::partition_equally

  • info::partition_property::partition_by_counts

  • info::partition_property::partition_by_affinity_domain

info::device::partition_type_affinity_domain

info::partition_affinity_domain

Returns the partition affinity domain of this SYCL device. If this SYCL device is not a sub device or the sub device was not partitioned with info::partition_type::partition_by_affinity_domain then the return value must be info::partition_affinity_domain::not_applicable, otherwise it must be one of the following values:

  • info::partition_affinity_domain::numa

  • info::partition_affinity_domain::L4_cache

  • info::partition_affinity_domain::L3_cache

  • info::partition_affinity_domain::L2_cache

  • info::partition_affinity_domain::L1_cache

4.6.4.3. Device aspects

Every SYCL device has an associated set of aspects which identify characteristics of the device. Aspects are defined via the enum class aspect enumeration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
namespace sycl {

enum class aspect : /* unspecified */ {
  cpu,
  gpu,
  accelerator,
  custom,
  emulated,
  host_debuggable,
  fp16,
  fp64,
  atomic64,
  image,
  online_compiler,
  online_linker,
  queue_profiling,
  usm_device_allocations,
  usm_host_allocations,
  usm_atomic_host_allocations,
  usm_shared_allocations,
  usm_atomic_shared_allocations,
  usm_system_allocations
};

} // namespace sycl

SYCL applications can query the aspects for a device via device::has() in order to determine whether the device supports any optional features. Table 26 lists the aspects that are defined in the core SYCL specification and tells which optional features correspond to each. Backends and extensions may provide additional aspects and additional optional device features. If so, the SYCL backend specification document or the extension document describes them.

Table 26. Device aspects defined by the core SYCL specification
Aspect Description
aspect::cpu

A device that runs on a CPU. Devices with this aspect have device type info::device_type::cpu.

aspect::gpu

A device that can also be used to accelerate a 3D graphics API. Devices with this aspect have device type info::device_type::gpu.

aspect::accelerator

A dedicated accelerator device, usually using a peripheral interconnect for communication. Devices with this aspect have device type info::device_type::accelerator.

aspect::custom

A dedicated accelerator that can use the SYCL API, but programmable kernels cannot be dispatched to the device, only fixed functionality is available. See Section 3.9.7. Devices with this aspect have device type info::device_type::custom.

aspect::emulated

Indicates that the device is somehow emulated. A device with this aspect is not intended for performance, and instead will generally have another purpose such as emulation or profiling. The precise definition of this aspect is left open to the SYCL implementation.

As an example, a vendor might support both a hardware FPGA device and a software emulated FPGA, where the emulated FPGA has all the same features as the hardware one but runs more slowly and can provide additional profiling or diagnostic information. In such a case, an application’s device selector can use aspect::emulated to distinguish the two.

aspect::host_debuggable

Indicates that kernels running on this device can be debugged using standard debuggers that are normally available on the host system where the SYCL implementation resides. The precise definition of this aspect is left open to the SYCL implementation.

aspect::fp16

Indicates that kernels submitted to the device may use the sycl::half data type.

aspect::fp64

Indicates that kernels submitted to the device may use the double data type.

aspect::atomic64

Indicates that kernels submitted to the device may perform 64-bit atomic operations.

aspect::image

Indicates that the device supports images.

aspect::online_compiler

Indicates that the device supports online compilation of device code. Devices that have this aspect support the build() and compile() functions defined in Section 4.11.11.

aspect::online_linker

Indicates that the device supports online linking of device code. Devices that have this aspect support the link() functions defined in Section 4.11.11. All devices that have this aspect also have aspect::online_compiler.

aspect::queue_profiling

Indicates that the device supports queue profiling via property::queue::enable_profiling.

aspect::usm_device_allocations

Indicates that the device supports explicit USM allocations as described in Section 4.8.

aspect::usm_host_allocations

Indicates that the device can access USM memory allocated via usm::alloc::host. The device only supports atomic modification of a host allocation if aspect::usm_atomic_host_allocations is also supported. (See Section 4.8.)

aspect::usm_atomic_host_allocations

Indicates that the device supports USM memory allocated via usm::alloc::host. The host and this device may concurrently access and atomically modify host allocations. (See Section 4.8.)

aspect::usm_shared_allocations

Indicates that the device supports USM memory allocated via usm::alloc::shared on the same device. Concurrent access and atomic modification of a shared allocation is only supported if aspect::usm_atomic_shared_allocations is also supported. (See Section 4.8.)

aspect::usm_atomic_shared_allocations

Indicates that the device supports USM memory allocated via usm::alloc::shared. The host and other devices in the same context that also support this capability may concurrently access and atomically modify shared allocations. The allocation is free to migrate between the host and the appropriate devices. (See Section 4.8.)

aspect::usm_system_allocations

Indicates that the system allocator may be used instead of SYCL USM allocation mechanisms for usm::alloc::shared allocations on this device. (See Section 4.8.)

The implementation also provides two traits that the application can use to query aspects at compilation time. The traits any_device_has<aspect> and all_devices_have<aspect> are set according to the collection of devices D that can possibly execute device code, as determined by the compilation environment. The trait any_device_has<aspect> inherits from std::true_type only if at least one device in D has the specified aspect. The trait all_devices_have<aspect> inherits from std::true_type only if all devices in D have the specified aspect.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
namespace sycl {

template <aspect Aspect> struct any_device_has;
template <aspect Aspect> struct all_devices_have;

template <aspect A>
inline constexpr bool any_device_has_v = any_device_has<A>::value;
template <aspect A>
inline constexpr bool all_devices_have_v = all_devices_have<A>::value;

} // namespace sycl

Applications can use these traits to reduce their code size. The following example demonstrates one way to use these traits to avoid instantiating a templated kernel for device features that are not supported by any device.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#include <sycl/sycl.hpp>
using namespace sycl; // (optional) avoids need for "sycl::" before SYCL names

constexpr int N = 512;

template <bool HasFp16> class MyKernel {
 public:
  void operator()(id<1> i) {
    if constexpr (HasFp16) {
      // Algorithm using sycl::half type
    } else {
      // Fall back code for devices that don't support sycl::half
    }
  }
};

int main() {
  queue myQueue;
  myQueue.submit([&](handler& cgh) {
    device dev = myQueue.get_device();
    if (dev.has(aspect::fp16)) {
      cgh.parallel_for(range { N },
                       MyKernel<any_device_has_v<aspect::fp16>> {});
    } else {
      cgh.parallel_for(range { N },
                       MyKernel<all_devices_have_v<aspect::fp16>> {});
    }
  });

  myQueue.wait();
}

The kernel function MyKernel is templated to use a different algorithm depending on whether the device has the aspect aspect::fp16, and the call to dev.has() chooses the kernel function instantiation that matches the device’s capabilities. However, the use of any_device_has_v and all_devices_have_v entirely avoid useless instantiations of the kernel function. For example, when the compilation environment does not support any devices with aspect::fp16, any_device_has_v<aspect::fp16> is false, and the kernel function is never instantiated with support for the sycl::half type.

Like any trait, the definitions of any_device_has and all_devices_have are uniform across all parts of a SYCL application. If an implementation uses SMCP, all compiler passes define a particular aspect’s specialization of the traits the same way, regardless of whether that compiler pass' device supports the aspect. Thus, any_device_has and all_devices_have cannot be used to determine whether any particular device supports an aspect. Instead, applications must use device::has() or platform::has() for this.

An implementation could choose to provide command line options which affect the set of devices that it supports. If so, those command line options would also affect these traits. For example, if an implementation provides a command line option that disables aspect::accelerator devices, the trait any_device_has<aspect::accelerator> would inherit from std::false_type when that command line option was specified.

These traits only reflect the supported devices at the time the SYCL application is compiled. It’s possible that unsupported devices are still visible to the application when it runs. However, if a device D is not supported when the application is compiled, the application will not be able to submit kernels to that device D.

4.6.5. Queue class

The SYCL queue class encapsulates a single SYCL queue which schedules kernels on a SYCL device.

A SYCL queue can be used to submit command groups to be executed by the SYCL runtime using the submit member function.

All member functions of the queue class are synchronous and errors are handled by throwing synchronous SYCL exceptions. The submit member function synchronously invokes the provided command group function object (as described in Section 3.7.1.2) in the calling thread, thereby scheduling a command group for asynchronous execution. Any error in the submission of a command group is handled by throwing a synchronous SYCL exception. Any errors from the command group after it has been submitted are handled by passing asynchronous errors at specific times to an async_handler, as described in Section 4.13.

A SYCL queue can wait for all command groups that it has submitted by calling wait or wait_and_throw.

The default constructor of the SYCL queue class will construct a queue based on the SYCL device returned from the default_selector_v (see Section 4.6.1.1). All other constructors construct a queue as determined by the parameters provided. All constructors will implicitly construct a SYCL platform, device and context in order to facilitate the construction of the queue.

Each constructor takes as the last parameter an optional SYCL property_list to provide properties to the SYCL queue.

A SYCL queue may be destroyed even when there are uncompleted commands that have been submitted to the queue. Doing so does not block. Instead, any commands that have been submitted to the queue begin execution when their requisites are satisfied, just as they would had the queue not been destroyed. Any event objects for those commands are signaled in the normal manner when the command completes. Resources associated with the queue will be freed by the time the last command completes.

The SYCL queue class provides the common reference semantics (see Section 4.5.2).

4.6.5.1. Queue interface

A synopsis of the SYCL queue class is provided below. The constructors and member functions of the SYCL queue class are listed in Table 27 and Table 28 respectively. The additional common special member functions and common member functions are listed in Section 4.5.2 in Table 7 and Table 8, respectively.

Some queue member functions are shortcuts to member functions of the handler class. These are listed in Section 4.6.5.2.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
namespace sycl {
class queue {
 public:
  explicit queue(const property_list& propList = {});

  explicit queue(const async_handler& asyncHandler,
                 const property_list& propList = {});

  template <typename DeviceSelector>
  explicit queue(const DeviceSelector& deviceSelector,
                 const property_list& propList = {});

  template <typename DeviceSelector>
  explicit queue(const DeviceSelector& deviceSelector,
                 const async_handler& asyncHandler,
                 const property_list& propList = {});

  explicit queue(const device& syclDevice, const property_list& propList = {});

  explicit queue(const device& syclDevice, const async_handler& asyncHandler,
                 const property_list& propList = {});

  template <typename DeviceSelector>
  explicit queue(const context& syclContext,
                 const DeviceSelector& deviceSelector,
                 const property_list& propList = {});

  template <typename DeviceSelector>
  explicit queue(const context& syclContext,
                 const DeviceSelector& deviceSelector,
                 const async_handler& asyncHandler,
                 const property_list& propList = {});

  explicit queue(const context& syclContext, const device& syclDevice,
                 const property_list& propList = {});

  explicit queue(const context& syclContext, const device& syclDevice,
                 const async_handler& asyncHandler,
                 const property_list& propList = {});

  /* -- common interface members -- */

  /* -- property interface members -- */

  backend get_backend() const noexcept;

  context get_context() const;

  device get_device() const;

  bool is_in_order() const;

  template <typename Param> typename Param::return_type get_info() const;

  template <typename Param>
  typename Param::return_type get_backend_info() const;

  template <typename T> event submit(T cgf);

  template <typename T> event submit(T cgf, const queue& secondaryQueue);

  void wait();

  void wait_and_throw();

  void throw_asynchronous();

  /* -- convenience shortcuts -- */

  template <typename KernelName, typename KernelType>
  event single_task(const KernelType& kernelFunc);

  template <typename KernelName, typename KernelType>
  event single_task(event depEvent, const KernelType& kernelFunc);

  template <typename KernelName, typename KernelType>
  event single_task(const std::vector<event>& depEvents,
                    const KernelType& kernelFunc);

  // Parameter pack acts as-if: Reductions&&... reductions, const KernelType
  // &kernelFunc
  template <typename KernelName, int Dims, typename... Rest>
  event parallel_for(range<Dims> numWorkItems, Rest&&... rest);

  // Parameter pack acts as-if: Reductions&&... reductions, const KernelType
  // &kernelFunc
  template <typename KernelName, int Dims, typename... Rest>
  event parallel_for(range<Dims> numWorkItems, event depEvent, Rest&&... rest);

  // Parameter pack acts as-if: Reductions&&... reductions, const KernelType
  // &kernelFunc
  template <typename KernelName, int Dims, typename... Rest>
  event parallel_for(range<Dims> numWorkItems,
                     const std::vector<event>& depEvents, Rest&&... rest);

  // Parameter pack acts as-if: Reductions&&... reductions, const KernelType
  // &kernelFunc
  template <typename KernelName, int Dims, typename... Rest>
  event parallel_for(nd_range<Dims> executionRange, Rest&&... rest);

  // Parameter pack acts as-if: Reductions&&... reductions, const KernelType
  // &kernelFunc
  template <typename KernelName, int Dims, typename... Rest>
  event parallel_for(nd_range<Dims> executionRange, event depEvent,
                     Rest&&... rest);

  // Parameter pack acts as-if: Reductions&&... reductions, const KernelType
  // &kernelFunc
  template <typename KernelName, int Dims, typename... Rest>
  event parallel_for(nd_range<Dims> executionRange,
                     const std::vector<event>& depEvents, Rest&&... rest);

  /* -- USM functions -- */

  event memcpy(void* dest, const void* src, size_t numBytes);
  event memcpy(void* dest, const void* src, size_t numBytes, event depEvent);
  event memcpy(void* dest, const void* src, size_t numBytes,
               const std::vector<event>& depEvents);

  template <typename T> event copy(const T* src, T* dest, size_t count);
  template <typename T>
  event copy(const T* src, T* dest, size_t count, event depEvent);
  template <typename T>
  event copy(const T* src, T* dest, size_t count,
             const std::vector<event>& depEvents);

  event memset(void* ptr, int value, size_t numBytes);
  event memset(void* ptr, int value, size_t numBytes, event depEvent);
  event memset(void* ptr, int value, size_t numBytes,
               const std::vector<event>& depEvents);

  template <typename T> event fill(void* ptr, const T& pattern, size_t count);
  template <typename T>
  event fill(void* ptr, const T& pattern, size_t count, event depEvent);
  template <typename T>
  event fill(void* ptr, const T& pattern, size_t count,
             const std::vector<event>& depEvents);

  event prefetch(void* ptr, size_t numBytes);
  event prefetch(void* ptr, size_t numBytes, event depEvent);
  event prefetch(void* ptr, size_t numBytes,
                 const std::vector<event>& depEvents);

  event mem_advise(void* ptr, size_t numBytes, int advice);
  event mem_advise(void* ptr, size_t numBytes, int advice, event depEvent);
  event mem_advise(void* ptr, size_t numBytes, int advice,
                   const std::vector<event>& depEvents);

  /// Placeholder accessor shortcuts

  // Explicit copy functions

  template <typename SrcT, int SrcDims, access_mode SrcMode, target SrcTgt,
            access::placeholder IsPlaceholder, typename DestT>
  event copy(accessor<SrcT, SrcDims, SrcMode, SrcTgt, IsPlaceholder> src,
             std::shared_ptr<DestT> dest);

  template <typename SrcT, typename DestT, int DestDims, access_mode DestMode,
            target DestTgt, access::placeholder IsPlaceholder>
  event copy(std::shared_ptr<SrcT> src,
             accessor<DestT, DestDims, DestMode, DestTgt, IsPlaceholder> dest);

  template <typename SrcT, int SrcDims, access_mode SrcMode, target SrcTgt,
            access::placeholder IsPlaceholder, typename DestT>
  event copy(accessor<SrcT, SrcDims, SrcMode, SrcTgt, IsPlaceholder> src,
             DestT* dest);

  template <typename SrcT, typename DestT, int DestDims, access_mode DestMode,
            target DestTgt, access::placeholder IsPlaceholder>
  event copy(const SrcT* src,
             accessor<DestT, DestDims, DestMode, DestTgt, IsPlaceholder> dest);

  template <typename SrcT, int SrcDims, access_mode SrcMode, target SrcTgt,
            access::placeholder IsSrcPlaceholder, typename DestT, int DestDims,
            access_mode DestMode, target DestTgt,
            access::placeholder IsDestPlaceholder>
  event
  copy(accessor<SrcT, SrcDims, SrcMode, SrcTgt, IsSrcPlaceholder> src,
       accessor<DestT, DestDims, DestMode, DestTgt, IsDestPlaceholder> dest);

  template <typename T, int Dims, access_mode Mode, target Tgt,
            access::placeholder IsPlaceholder>
  event update_host(accessor<T, Dim, Mode, Tgt, IsPlaceholder> acc);

  template <typename T, int Dims, access_mode Mode, target Tgt,
            access::placeholder IsPlaceholder>
  event fill(accessor<T, Dims, Mode, Tgt, IsPlaceholder> dest, const T& src);
};
} // namespace sycl
Table 27. Constructors of the queue class
Constructor Description
explicit queue(const property_list& propList = {})

Constructs a SYCL queue instance using the device constructed from the default_selector_v. Zero or more properties can be provided to the constructed SYCL queue via an instance of property_list.

explicit queue(const async_handler& asyncHandler,
               const property_list& propList = {})

Constructs a SYCL queue instance with an async_handler using the device constructed from the default_selector_v. Zero or more properties can be provided to the constructed SYCL queue via an instance of property_list.

template <typename DeviceSelector>
explicit queue(const DeviceSelector& deviceSelector,
               const property_list& propList = {})

Constructs a SYCL queue instance using the device returned by the device selector provided. Zero or more properties can be provided to the constructed SYCL queue via an instance of property_list.

template <typename DeviceSelector>
explicit queue(const DeviceSelector& deviceSelector,
               const async_handler& asyncHandler,
               const property_list& propList = {})

Constructs a SYCL queue instance with an async_handler using the device returned by the device selector provided. Zero or more properties can be provided to the constructed SYCL queue via an instance of property_list.

explicit queue(const device& syclDevice, const property_list& propList = {})

Constructs a SYCL queue instance using the syclDevice provided. Zero or more properties can be provided to the constructed SYCL queue via an instance of property_list.

explicit queue(const device& syclDevice, const async_handler& asyncHandler,
               const property_list& propList = {})

Constructs a SYCL queue instance with an async_handler using the syclDevice provided. Zero or more properties can be provided to the constructed SYCL queue via an instance of property_list.

template <typename DeviceSelector>
explicit queue(const context& syclContext, const DeviceSelector& deviceSelector,
               const property_list& propList = {})

Constructs a SYCL queue instance that is associated with the syclContext provided, using the device returned by the device selector provided. Must throw an exception with the errc::invalid error code if syclContext does not encapsulate the SYCL device returned by deviceSelector. Zero or more properties can be provided to the constructed SYCL queue via an instance of property_list.

template <typename DeviceSelector>
explicit queue(const context& syclContext, const DeviceSelector& deviceSelector,
               const async_handler& asyncHandler,
               const property_list& propList = {})

Constructs a SYCL queue instance with an async_handler that is associated with the syclContext provided, using the device returned by the device selector provided. Must throw an exception with the errc::invalid error code if syclContext does not encapsulate the SYCL device returned by deviceSelector. Zero or more properties can be provided to the constructed SYCL queue via an instance of property_list.

explicit queue(const context& syclContext, const device& syclDevice,
               const property_list& propList = {})

Constructs a SYCL queue instance using the syclDevice provided. This device must either be contained by syclContext or it must be a descendent device of some device that is contained by that context, otherwise this function throws a synchronous exception with the errc::invalid error code. Zero or more properties can be provided to the constructed SYCL queue via an instance of property_list.

explicit queue(const context& syclContext, const device& syclDevice,
               const async_handler& asyncHandler,
               const property_list& propList = {})

Constructs a SYCL queue instance with an async_handler using the syclDevice provided. This device must either be contained by syclContext or it must be a descendent device of some device that is contained by that context, otherwise this function throws a synchronous exception with the errc::invalid error code. Zero or more properties can be provided to the constructed SYCL queue via an instance of property_list.

Table 28. Member functions for queue class
Member function Description
backend get_backend() const noexcept

Returns a backend identifying the SYCL backend associated with this queue.

context get_context() const

Returns the SYCL queue’s context. Reports errors using SYCL exception classes. The value returned must be equal to that returned by get_info<info::queue::context>().

device get_device() const

Returns the SYCL device the queue is associated with. Reports errors using SYCL exception classes. The value returned must be equal to that returned by get_info<info::queue::device>().

bool is_in_order() const

Returns true if the SYCL queue was created with the in_order property. Equivalent to has_property<property::queue::in_order>().

void wait()

Performs a blocking wait for the completion of all enqueued tasks in the queue. Synchronous errors will be reported through SYCL exceptions.

void wait_and_throw()

Performs a blocking wait for the completion of all enqueued tasks in the queue. Synchronous errors will be reported through SYCL exceptions. Any unconsumed asynchronous errors will be passed to the async_handler associated with the queue or enclosing context. If no user defined async_handler is associated with the queue or enclosing context, then an implementation-defined default async_handler is called to handle any errors, as described in Section 4.13.1.2.

void throw_asynchronous()

Checks to see if any unconsumed asynchronous errors have been produced by the queue and if so reports them by passing them to the async_handler associated with the queue or enclosing context. If no user defined async_handler is associated with the queue or enclosing context, then an implementation-defined default async_handler is called to handle any errors, as described in Section 4.13.1.2.

template <typename Param> typename Param::return_type get_info() const

Queries this SYCL queue for information requested by the template parameter Param. The type alias Param::return_type must be defined in accordance with the info parameters in Table 30 to facilitate returning the type associated with the Param parameter.

template <typename T> event submit(T cgf)

Submit a command group function object to the queue, in order to be scheduled for execution on the device.

template <typename T> event submit(T cgf, queue& secondaryQueue)

Submit a command group function object to the queue, in order to be scheduled for execution on the device. On a kernel error, this command group function object is then scheduled for execution on the secondary queue. Returns an event, which corresponds to the queue the command group function object is being enqueued on.

template <typename Param> typename Param::return_type get_backend_info() const

Queries this SYCL queue for SYCL backend-specific information requested by the template parameter Param. The type alias Param::return_type must be defined in accordance with the SYCL backend specification. Must throw an exception with the errc::backend_mismatch error code if the SYCL backend that corresponds with Param is different from the SYCL backend that is associated with this queue.

4.6.5.2. Queue shortcut functions

Queue shortcut functions are member functions of the queue class that implicitly create a command group with an implicit command group handler consisting of a single command, a call to the member function of the handler object with the same signature (e.g. queue::single_task will call handler::single_task with the same arguments), and submit the command group. The main signature difference comes from the return type: member functions of the handler return void, whereas corresponding queue shortcut functions return an event object that represents the submitted command group. Queue shortcuts can additionally take a list of events to wait on, as if passing the event list to handler::depends_on for the implicit command group.

The full list of queue shortcuts is defined in Table 29. The list of handler member functions is defined in Table 129.

It is not allowed to capture accessors into the implicitly created command group. If a queue shortcut function launches a kernel (via single_task or parallel_for), only USM pointers are allowed inside such kernels. However, queue shortcuts that perform non-kernel operations can be provided with a valid placeholder accessor as an argument. In that case there is an additional step performed: the implicit command group handler calls handler::require on each accessor passed in as a function argument.

An example of using queue shortcuts is shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
class MyKernel;

queue myQueue;
auto usmPtr = malloc_device<int>(1024, myQueue); // USM pointer

int* data = /* pointer to some data */;
buffer buf { data, 1024 };
accessor acc { buf }; // Placeholder accessor

// Queue shortcut for a kernel invocation
myQueue.single_task<MyKernel>([=] {
  // Allowed to use USM pointers,
  // not allowed to use accessors
  usmPtr[0] = 0;
});

// Placeholder accessor will automatically be registered
myQueue.copy(data, acc);
Table 29. Queue shortcut functions
Function Definition Function Type Description
template <typename KernelName, typename KernelType>
event single_task(const KernelType& kernelFunc)

Equivalent to submitting a command-group containing handler::single_task(kernelFunc).

template <typename KernelName, typename KernelType>
event single_task(event depEvent, const KernelType& kernelFunc)

Equivalent to submitting a command-group containing handler::depends_on(depEvent) and handler::single_task(kernelFunc).

template <typename KernelName, typename KernelType>
event single_task(const std::vector<event>& depEvents,
                  const KernelType& kernelFunc)

Equivalent to submitting a command-group containing handler::depends_on(depEvents) and handler::single_task(kernelFunc).

template <typename KernelName, int Dimensions, typename... Rest>
event parallel_for(range<Dimensions> numWorkItems, Rest&&... rest)

Equivalent to submitting a command-group containing handler::parallel_for(numWorkItems, rest).

template <typename KernelName, int Dimensions, typename... Rest>
event parallel_for(range<Dimensions> numWorkItems, event depEvent,
                   Rest&&... rest)

Equivalent to submitting a command-group containing handler::depends_on(depEvent) and handler::parallel_for(numWorkItems, rest).

template <typename KernelName, int Dimensions, typename... Rest>
event parallel_for(range<Dimensions> numWorkItems,
                   const std::vector<event>& depEvents, Rest&&... rest)

Equivalent to submitting a command-group containing handler::depends_on(depEvents) and handler::parallel_for(numWorkItems, rest).

template <typename KernelName, int Dimensions, typename... Rest>
event parallel_for(nd_range<Dimensions> executionRange, Rest&&... rest)

Equivalent to submitting a command-group containing handler::parallel_for(executionRange, rest).

template <typename KernelName, int Dimensions, typename... Rest>
event parallel_for(nd_range<Dimensions> executionRange, event depEvent,
                   Rest&&... rest)

Equivalent to submitting a command-group containing handler::depends_on(depEvent) and handler::parallel_for(executionRange, rest).

template <typename KernelName, int Dimensions, typename... Rest>
event parallel_for(nd_range<Dimensions> executionRange,
                   const std::vector<event>& depEvents, Rest&&... rest)

Equivalent to submitting a command-group containing handler::depends_on(depEvents) and handler::parallel_for(executionRange, rest).

event memcpy(void* dest, const void* src, size_t numBytes)

USM

Equivalent to submitting a command-group containing handler::memcpy(dest, src, numBytes).

event memcpy(void* dest, const void* src, size_t numBytes, event depEvent)

USM

Equivalent to submitting a command-group containing handler::depends_on(depEvent) and handler::memcpy(dest, src, numBytes).

event memcpy(void* dest, const void* src, size_t numBytes,
             const std::vector<event>& depEvents)

USM

Equivalent to submitting a command-group containing handler::depends_on(depEvents) and handler::memcpy(dest, src, numBytes).

template <typename T> event copy(const T* src, T* dest, size_t count)

USM

Equivalent to submitting a command-group containing handler::copy(src, dest, count).

template <typename T>
event copy(const T* src, T* dest, size_t count, event depEvent)

USM

Equivalent to submitting a command-group containing handler::depends_on(depEvent) and handler::copy(src, dest, count).

template <typename T>
event copy(const T* srct, T* dest, size_t count,
           const std::vector<event>& depEvents)

USM

Equivalent to submitting a command-group containing handler::depends_on(depEvents) and handler::copy(src, dest, count).

event memset(void* ptr, int value, size_t numBytes)

USM

Equivalent to submitting a command-group containing handler::memset(ptr, value, numBytes).

event memset(void* ptr, int value, size_t numBytes, event depEvent)

USM

Equivalent to submitting a command-group containing handler::depends_on(depEvent) and handler::memset(ptr, value, numBytes).

event memset(void* ptr, int value, size_t numBytes,
             const std::vector<event>& depEvents)

USM

Equivalent to submitting a command-group containing handler::depends_on(depEvents) and handler::memset(ptr, value, numBytes).

template <typename T> event fill(void* ptr, const T& pattern, size_t count)

USM

Equivalent to submitting a command-group containing handler::fill(ptr, pattern, count).

template <typename T>
event fill(void* ptr, const T& pattern, size_t count, event depEvent)

USM

Equivalent to submitting a command-group containing handler::depends_on(depEvent) and handler::fill(ptr, pattern, count).

template <typename T>
event fill(void* ptr, const T& pattern, size_t count,
           const std::vector<event>& depEvents)

USM

Equivalent to submitting a command-group containing handler::depends_on(depEvents) and handler::fill(ptr, pattern, count).

event prefetch(void* ptr, size_t numBytes)

USM

Equivalent to submitting a command-group containing handler::prefetch(ptr, numBytes).

event prefetch(void* ptr, size_t numBytes, event depEvent)

USM

Equivalent to submitting a command-group containing handler::depends_on(depEvent) and handler::prefetch(ptr, numBytes).

event prefetch(void* ptr, size_t numBytes, const std::vector<event>& depEvents)

USM

Equivalent to submitting a command-group containing handler::depends_on(depEvents) and handler::prefetch(ptr, numBytes).

event mem_advise(void* ptr, size_t numBytes, int advice)

USM

Equivalent to submitting a command-group containing handler::mem_advise(ptr, numBytes, advice).

event mem_advise(void* ptr, size_t numBytes, int advice, event depEvent)

USM

Equivalent to submitting a command-group containing handler::depends_on(depEvent) and handler::mem_advise(ptr, numBytes, advice).

event mem_advise(void* ptr, size_t numBytes, int advice,
                 const std::vector<event>& depEvents)

USM

Equivalent to submitting a command-group containing handler::depends_on(depEvents) and handler::mem_advise(ptr, numBytes, advice).

template <typename SrcT, int SrcDims, access_mode SrcMode, target SrcTgt,
          access::placeholder IsPlaceholder, typename DestT>
event copy(accessor<SrcT, SrcDims, SrcMode, SrcTgt, IsPlaceholder> src,
           std::shared_ptr<DestT> dest);

Equivalent to submitting a command-group containing handler::require(src) and handler::copy(src, dest).

template <typename SrcT, typename DestT, int DestDims, access_mode DestMode,
          target DestTgt, access::placeholder IsPlaceholder>
event copy(std::shared_ptr<SrcT> src,
           accessor<DestT, DestDims, DestMode, DestTgt, IsPlaceholder> dest);

Equivalent to submitting a command-group containing handler::require(dest) and handler::copy(src, dest).

template <typename SrcT, int SrcDims, access_mode SrcMode, target SrcTgt,
          access::placeholder IsPlaceholder, typename DestT>
event copy(accessor<SrcT, SrcDims, SrcMode, SrcTgt, IsPlaceholder> src,
           DestT* dest);

Equivalent to submitting a command-group containing handler::require(src) and handler::copy(src, dest).

template <typename SrcT, typename DestT, int DestDims, access_mode DestMode,
          target DestTgt, access::placeholder IsPlaceholder>
event copy(const SrcT* src,
           accessor<DestT, DestDims, DestMode, DestTgt, IsPlaceholder> dest);

Equivalent to submitting a command-group containing handler::require(dest) and handler::copy(src, dest).

template <typename SrcT, int SrcDims, access_mode SrcMode, target SrcTgt,
          access::placeholder IsSrcPlaceholder, typename DestT, int DestDims,
          access_mode DestMode, target DestTgt,
          access::placeholder IsDestPlaceholder>
event copy(
    accessor<SrcT, SrcDims, SrcMode, SrcTgt, IsSrcPlaceholder> src,
    accessor<DestT, DestDims, DestMode, DestTgt, IsDestPlaceholder> dest);

Equivalent to submitting a command-group containing handler::require(src), handler::require(dest) and handler::copy(src, dest).

template <typename T, int Dims, access_mode Mode, target Tgt,
          access::placeholder IsPlaceholder>
event update_host(accessor<T, Dims, Mode, Tgt, IsPlaceholder> acc);

Equivalent to submitting a command-group containing handler::require(acc) and handler::update_host(acc).

template <typename T, int Dims, access_mode Mode, target Tgt,
          access::placeholder IsPlaceholder>
event fill(accessor<T, Dims, Mode, Tgt, IsPlaceholder> dest, const T& src);

Equivalent to submitting a command-group containing handler::require(dest) and handler::fill(dest, src).

4.6.5.3. Queue information descriptors

A queue can be queried for information using the get_info member function of the queue class, specifying one of the info parameters in info::queue. The possible values for each info parameter and any restriction are defined in the specification of the SYCL backend associated with the queue. All info parameters in info::queue are specified in Table 30 and the synopsis for info::queue is described in Section A.4.

Table 30. Queue information descriptors
Queue Descriptors Return type Description
info::queue::context

context

Returns the SYCL context associated with this SYCL queue.

info::queue::device

device

Returns the SYCL device associated with this SYCL queue.

4.6.5.4. Queue properties

The properties that can be provided when constructing the SYCL queue class are describe in Table 31.

Table 31. Properties supported by the SYCL queue class
Property Description
property::queue::enable_profiling

The enable_profiling property adds the requirement that the SYCL runtime must capture profiling information for the command groups that are submitted from this SYCL queue and provide said information via the SYCL event class get_profiling_info member function. If the queue’s associated device does not have aspect::queue_profiling, passing this property to the queue’s constructor causes the constructor to throw a synchronous exception with the errc::feature_not_supported error code.

property::queue::in_order

The in_order property adds the requirement that a SYCL queue provides in-order semantics whereby commands submitted to said queue are executed in the order in which they are submitted. Commands submitted in this fashion can be viewed as-if having an implicit dependence on the previous command submitted to that queue. Using the in_order property makes no guarantees about the ordering of commands submitted to different queues with respect to each other.

The constructors of the queue property classes are listed in Table 32.

Table 32. Constructors of the queue property classes
Constructor Description
property::queue::enable_profiling::enable_profiling()

Constructs a SYCL enable_profiling property instance.

property::queue::in_order::in_order()

Constructs a SYCL in_order property instance.

4.6.5.5. Queue error handling

Queue errors come in two forms:

  • Synchronous Errors are those that we would expect to be reported directly at the point of waiting on an event, and hence waiting for a queue to complete, as well as any immediate errors reported by enqueuing work onto a queue. Such errors are reported through C++ exceptions.

  • Asynchronous errors are those that are produced or detected after associated host API calls have returned (so can’t be thrown as exceptions by the API call), and that are handled by an async_handler through which the errors are reported. Handling of asynchronous errors from a queue occurs at specific times, as described by Section 4.13.

Note that if there are asynchronous errors to be processed when a queue is destructed, the handler is called and this might delay or block the destruction, according to the behavior of the handler.

4.6.6. Event class

An event in SYCL is an object that represents the status of an operation that is being executed by the SYCL runtime.

Typically in SYCL, data dependency and execution order is handled implicitly by the SYCL runtime. However, in some circumstances developers want fine grain control of the execution, or want to retrieve properties of a command that is running.

Note that, although an event represents the status of a particular operation, the dependencies of a certain event can be used to keep track of multiple steps required to synchronize said operation.

A SYCL event is returned by the submission of a command group. The dependencies of the event returned via the submission of the command group are the implementation-defined commands associated with the command group execution.

The SYCL event class provides the common reference semantics (see Section 4.5.2).

The constructors and member functions of the SYCL event class are listed in Table 33 and Table 34, respectively. The additional common special member functions and common member functions are listed in Table 7 and Table 8, respectively.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
namespace sycl {

class event {
 public:
  event();

  /* -- common interface members -- */

  backend get_backend() const noexcept;

  std::vector<event> get_wait_list();

  void wait();

  static void wait(const std::vector<event>& eventList);

  void wait_and_throw();

  static void wait_and_throw(const std::vector<event>& eventList);

  template <typename Param> typename Param::return_type get_info() const;

  template <typename Param>
  typename Param::return_type get_backend_info() const;

  template <typename Param>
  typename Param::return_type get_profiling_info() const;
};

} // namespace sycl
Table 33. Constructors of the event class
Constructor Description
event()

Constructs an event that is immediately ready. The event has no dependencies and no associated commands. Waiting on this event will return immediately and querying its status will return info::event_command_status::complete.

The event is constructed as though it was created from a default-constructed queue. Therefore, its backend is the same as the backend from the default device.

Table 34. Member functions for the event class
Member function Description
backend get_backend() const noexcept

Returns a backend identifying the SYCL backend associated with this event.

std::vector<event> get_wait_list()

Return the list of events that this event waits for in the dependence graph. Only direct dependencies are returned, and not transitive dependencies that direct dependencies wait on. Whether already completed events are included in the returned list is implementation-defined.

void wait()

Wait for the event and the command associated with it to complete.

void wait_and_throw()

Wait for the event and the command associated with it to complete.

Any unconsumed asynchronous errors from any context that the event was waiting on executions from will be passed to the async_handler associated with the context. If no user defined async_handler is associated with the context, then an implementation-defined default async_handler is called to handle any errors, as described in Section 4.13.1.2.

static void wait(const std::vector<event>& eventList)

Synchronously wait on a list of events.

static void wait_and_throw(const std::vector<event>& eventList)

Synchronously wait on a list of events.

Any unconsumed asynchronous errors from any context that the event was waiting on executions from will be passed to the async_handler associated with the context. If no user defined async_handler is associated with the context, then an implementation-defined default async_handler is called to handle any errors, as described in Section 4.13.1.2.

template <typename Param> typename Param::return_type get_info() const

Queries this SYCL event for information requested by the template parameter Param. The type alias Param::return_type must be defined in accordance with the info parameters in Table 35 to facilitate returning the type associated with the Param parameter.

template <typename Param> typename Param::return_type get_backend_info() const

Queries this SYCL event for SYCL backend-specific information requested by the template parameter Param. The type alias Param::return_type must be defined in accordance with the SYCL backend specification. Must throw an exception with the errc::backend_mismatch error code if the SYCL backend that corresponds with Param is different from the SYCL backend that is associated with this event.

template <typename Param> typename Param::return_type get_profiling_info() const

Queries this SYCL event for profiling information requested by the parameter Param. If the requested profiling information is unavailable when get_profiling_info is called due to incompletion of command groups associated with the event, then the call to get_profiling_info will block until the requested profiling information is available. An example is asking for info::event_profiling::command_end when the associated command group action has yet to finish execution. Calls to get_profiling_info must throw an exception with the errc::invalid error code if the SYCL queue which submitted the command group this SYCL event is associated with was not constructed with the property::queue::enable_profiling property. The type alias Param::return_type must be defined in accordance with the info parameters in Table 37 to facilitate returning the type associated with the Param parameter.

4.6.6.1. Event information and profiling descriptors

An event can be queried for information using the get_info member function of the event class, specifying one of the info parameters in info::event. The possible values for each info parameter and any restrictions are defined in the specification of the SYCL backend associated with the event. All info parameters in info::event are specified in Table 35 and the synopsis for info::event is described in Section A.6.

Table 35. Event class information descriptors
Event Descriptors Return type Description
info::event::command_execution_status

info::event_command_status

Returns the event status of the command group and contained action (e.g. kernel invocation) associated with this SYCL event.

The info::event::command_execution_status query returns one of the values defined in Table 36.

Table 36. Event command status
Status Description
info::event_command_status::submitted

Indicates that the command has been submitted to the SYCL queue but has not yet started running on the device.

info::event_command_status::running

Indicates that the command has started running on the device but has not yet completed.

info::event_command_status::complete

Indicates that the command has finished running on the device. Attempting to wait on such an event will not block.

An event can be queried for profiling information using the get_profiling_info member function of the event class, specifying one of the profiling info parameters enumerated in info::event_profiling. The possible values for each info parameter and any restrictions are defined in the specification of the SYCL backend associated with the event. All info parameters in info::event_profiling are specified in Table 37 and the synopsis for info::event_profiling is described in Section A.6.

Each profiling descriptor returns a 64-bit timestamp that represents the number of nanoseconds that have elapsed since some implementation-defined timebase. All events that share the same backend are guaranteed to share the same timebase, therefore the difference between two timestamps from the same backend yields the number of nanoseconds that have elapsed between those events.

Table 37. Profiling information descriptors for the SYCL event class
Event information profiling descriptor Return type Description
info::event_profiling::command_submit

uint64_t

Returns a timestamp telling when the associated command group was submitted to the queue. This is always some time after the command group function object returns and before the associated call to queue::submit returns.

info::event_profiling::command_start

uint64_t

Querying this profiling descriptor blocks until the event’s state becomes either info::event_command_status::running or info::event_command_status::complete. The returned timestamp tells when the action associated with the command group (e.g. kernel invocation) started executing on the device. For any given event, this timestamp is always greater than or equal to the info::event_profiling::command_submit timestamp. Implementations are encouraged to return a timestamp that is as close as possible to the point when the action starts running on the device, but there is no specific accuracy that is guaranteed.

info::event_profiling::command_end

uint64_t

Querying this profiling descriptor blocks until the event’s state becomes info::event_command_status::complete. The returned timestamp tells when the action associated with the command group (e.g. kernel invocation) finished executing on the device. For any given event, this timestamp is always greater than or equal to the info::event_profiling::command_start timestamp.

4.7. Data access and storage in SYCL

In SYCL, when using buffers and images, data storage and access are handled by separate classes. Buffers and images handle storage and ownership of the data, whereas accessors handle access to the data. Buffers and images in SYCL can be bound to more than one device or context, including across different SYCL backends. They also handle ownership of the data, while allowing exception handling for blocking and non-blocking data transfers. Accessors manage data transfers between the host and all of the devices in the system, as well as tracking of data dependencies.

Zero-sized buffers and accessors are permitted, but attempting to access data within them produces undefined behavior, similar to dereferencing a null pointer in C++. Note that zero-sized accessors can be created in several ways: by creating an accessor from a zero-sized buffer, by creating an accessor with a zero-sized buffer sub-range, or by creating an accessor with its default constructor.

When using USM allocations, data storage is managed by USM allocation functions, and data access is via pointers. See Section 4.8 for greater detail.

4.7.1. Host allocation

A SYCL runtime may need to allocate temporary objects on the host to handle some operations (such as copying data from one context to another). Allocation on the host is managed using an allocator object, following the standard C++ allocator class definition. The default allocator for memory objects is implementation-defined, but the user can supply their own allocator class.

1
2
3
{
    buffer<int, 1, UserDefinedAllocator<int>> b(d);
}

When an allocator returns a nullptr, the runtime cannot allocate data on the host. Note that in this case the runtime will raise an error if it requires host memory but it is not available (e.g when moving data across SYCL backend contexts).

In some cases, the implementation may retain a copy of the allocator object even after the buffer is destroyed. For example, this can happen when the buffer object is destroyed before commands using accessors to the buffer have completed. Therefore, the application must be prepared for calls to the allocator even after the buffer is destroyed.

If the application needs to know when the implementation has destroyed all copies of the allocator, it can maintain a reference count within the allocator.

The definition of allocators extends the current functionality of SYCL, ensuring that users can define allocator functions for specific hardware or certain complex shared memory mechanisms (e.g. NUMA), and improves interoperability with STL-based libraries (e.g, Intel’s TBB provides an allocator).

4.7.1.1. Default allocators

A default allocator is always defined by the implementation. For allocations greater than size zero, it is guaranteed to return non-nullptr and new memory positions every call. The default allocator for const buffers will remove the const-ness of the type (therefore, the default allocator for a buffer of type const int will be an Allocator<int>). This implies that host accessors will not synchronize with the pointer given by the user in the buffer/image constructor, but will use the memory returned by the Allocator itself for that purpose. The user can implement an allocator that returns the same address as the one passed in the buffer constructor, but it is the responsibility of the user to handle the potential race conditions.

Table 38. SYCL Default Allocators
Allocators Description
template <class T> buffer_allocator

It is the default buffer allocator used by the runtime, when no allocator is defined by the user. Meets the C++ named requirement Allocator. A buffer of data type const T uses buffer_allocator<T> by default.

image_allocator

It is the default allocator used by the runtime for the SYCL unsampled_image and sampled_image classes when no allocator is provided by the user. The image_allocator is required to allocate in elements of std::byte.

See Section 4.7.5 for details on manual host-device synchronization.

4.7.2. Buffers

The buffer class defines a shared array of one, two or three dimensions that can be used by the SYCL kernel and has to be accessed using accessor classes. Buffers are templated on both the type of their data, and the number of dimensions that the data is stored and accessed through.

A buffer does not map to only one underlying backend object, and all SYCL backend memory objects may be temporary for use within a command group on a specific device.

The underlying data type of a buffer T must be device copyable as defined in Section 3.13.1. Some overloads of the buffer constructor initialize the buffer contents by copying objects from host memory while other overloads construct the buffer without copying objects from the host. For the overloads that do not copy host objects, the initial state of the objects in the buffer depends on whether T is an implicit-lifetime type (as defined in the C++ core language). If T is an implicit-lifetime type, objects of that type are implicitly created in the buffer with indeterminate values. For other types, these constructor overloads merely allocate uninitialized memory, and the application is responsible for constructing objects by calling placement-new and for destroying them later by manually calling the object’s destructor.

For the overloads that do copy objects from host memory, the hostData pointer must point to at least N bytes of memory where N is sizeof(T) * bufferRange.size(). If N is zero, hostData is permitted to be a null pointer.

A SYCL buffer can construct an instance of a SYCL buffer that reinterprets the original SYCL buffer with a different type, dimensionality and range using the member function reinterpret. The reinterpreted SYCL buffer that is constructed must behave as though it were a copy of the SYCL buffer that constructed it (see Section 4.5.2) with the exception that the type, dimensionality and range of the reinterpreted SYCL buffer must reflect the type, dimensionality and range specified when calling the reinterpret member function. By extension of this, the class member types value_type, reference and const_reference, and the member functions get_range() and size() of the reinterpreted SYCL buffer must reflect the new type, dimensionality and range. The data that the original SYCL buffer and the reinterpreted SYCL buffer manage remains unaffected, though the representation of the data when accessed through the reinterpreted SYCL buffer may alter to reflect the new type, dimensionality and range. It is important to note that a reinterpreted SYCL buffer is a copy of the original SYCL buffer only, and not a new SYCL buffer. Constructing more than one SYCL buffer managing the same host pointer is still undefined behavior.

The SYCL buffer class template provides the common reference semantics (see Section 4.5.2).

4.7.2.1. Buffer interface

The constructors and member functions of the SYCL buffer class template are listed in Table 39 and Table 40, respectively. The additional common special member functions and common member functions are listed in Table 7 and Table 8, respectively.

Each constructor takes as the last parameter an optional SYCL property_list to provide properties to the SYCL buffer.

The SYCL buffer class template takes a template parameter AllocatorT for specifying an allocator which is used by the SYCL runtime when allocating temporary memory on the host. If no template argument is provided, then the default allocator for the SYCL buffer class buffer_allocator<T> will be used (see Section 4.7.1.1).

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
namespace sycl {
namespace property {
namespace buffer {
class use_host_ptr {
 public:
  use_host_ptr() = default;
};

class use_mutex {
 public:
  use_mutex(std::mutex& mutexRef);

  std::mutex* get_mutex_ptr() const;
};

class context_bound {
 public:
  context_bound(context boundContext);

  context get_context() const;
};
} // namespace buffer
} // namespace property

template <typename T, int Dimensions = 1,
          typename AllocatorT = buffer_allocator<std::remove_const_t<T>>>
class buffer {
 public:
  using value_type = T;
  using reference = value_type&;
  using const_reference = const value_type&;
  using allocator_type = AllocatorT;

  buffer(const range<Dimensions>& bufferRange,
         const property_list& propList = {});

  buffer(const range<Dimensions>& bufferRange, AllocatorT allocator,
         const property_list& propList = {});

  buffer(T* hostData, const range<Dimensions>& bufferRange,
         const property_list& propList = {});

  buffer(T* hostData, const range<Dimensions>& bufferRange,
         AllocatorT allocator, const property_list& propList = {});

  buffer(const T* hostData, const range<Dimensions>& bufferRange,
         const property_list& propList = {});

  buffer(const T* hostData, const range<Dimensions>& bufferRange,
         AllocatorT allocator, const property_list& propList = {});

  /* Available only if Container is a contiguous container:
       - std::data(container) and std::size(container) are well formed
       - return type of std::data(container) is convertible to T*
     and Dimensions == 1 */
  template <typename Container>
  buffer(Container& container, AllocatorT allocator,
         const property_list& propList = {});

  /* Available only if Container is a contiguous container:
       - std::data(container) and std::size(container) are well formed
       - return type of std::data(container) is convertible to T*
     and Dimensions == 1 */
  template <typename Container>
  buffer(Container& container, const property_list& propList = {});

  buffer(const std::shared_ptr<T>& hostData,
         const range<Dimensions>& bufferRange, AllocatorT allocator,
         const property_list& propList = {});

  buffer(const std::shared_ptr<T>& hostData,
         const range<Dimensions>& bufferRange,
         const property_list& propList = {});

  buffer(const std::shared_ptr<T[]>& hostData,
         const range<Dimensions>& bufferRange, AllocatorT allocator,
         const property_list& propList = {});

  buffer(const std::shared_ptr<T[]>& hostData,
         const range<Dimensions>& bufferRange,
         const property_list& propList = {});

  template <class InputIterator>
  buffer<T, 1>(InputIterator first, InputIterator last, AllocatorT allocator,
               const property_list& propList = {});

  template <class InputIterator>
  buffer<T, 1>(InputIterator first, InputIterator last,
               const property_list& propList = {});

  buffer(buffer& b, const id<Dimensions>& baseIndex,
         const range<Dimensions>& subRange);

  /* -- common interface members -- */

  /* -- property interface members -- */

  range<Dimensions> get_range() const;

  size_t byte_size() const noexcept;

  size_t size() const noexcept;

  // Deprecated
  size_t get_count() const;

  // Deprecated
  size_t get_size() const;

  AllocatorT get_allocator() const;

  template <access_mode Mode = access_mode::read_write,
            target Targ = target::device>
  accessor<T, Dimensions, Mode, Targ> get_access(handler& commandGroupHandler);

  // Deprecated
  template <access_mode Mode>
  accessor<T, Dimensions, Mode, target::host_buffer> get_access();

  template <access_mode Mode = access_mode::read_write,
            target Targ = target::device>
  accessor<T, Dimensions, Mode, Targ>
  get_access(handler& commandGroupHandler, range<Dimensions> accessRange,
             id<Dimensions> accessOffset = {});

  // Deprecated
  template <access_mode Mode>
  accessor<T, Dimensions, Mode, target::host_buffer>
  get_access(range<Dimensions> accessRange, id<Dimensions> accessOffset = {});

  template <typename... Ts> auto get_access(Ts...);

  template <typename... Ts> auto get_host_access(Ts...);

  template <typename Destination = std::nullptr_t>
  void set_final_data(Destination finalData = nullptr);

  void set_write_back(bool flag = true);

  bool is_sub_buffer() const;

  template <typename ReinterpretT, int ReinterpretDim>
  buffer<ReinterpretT, ReinterpretDim,
         typename std::allocator_traits<AllocatorT>::template rebind_alloc<
             ReinterpretT>>
  reinterpret(range<ReinterpretDim> reinterpretRange) const;

  // Only available when ReinterpretDim == 1
  // or when (ReinterpretDim == Dimensions) &&
  //         (sizeof(ReinterpretT) == sizeof(T))
  template <typename ReinterpretT, int ReinterpretDim = Dimensions>
  buffer<ReinterpretT, ReinterpretDim,
         typename std::allocator_traits<AllocatorT>::template rebind_alloc<
             ReinterpretT>>
  reinterpret() const;
};

// Deduction guides
template <class InputIterator, class AllocatorT>
buffer(InputIterator, InputIterator, AllocatorT, const property_list& = {})
    -> buffer<typename std::iterator_traits<InputIterator>::value_type, 1,
              AllocatorT>;

template <class InputIterator>
buffer(InputIterator, InputIterator, const property_list& = {})
    -> buffer<typename std::iterator_traits<InputIterator>::value_type, 1>;

template <class T, int Dimensions, class AllocatorT>
buffer(const T*, const range<Dimensions>&, AllocatorT,
       const property_list& = {}) -> buffer<T, Dimensions, AllocatorT>;

template <class T, int Dimensions>
buffer(const T*, const range<Dimensions>&, const property_list& = {})
    -> buffer<T, Dimensions>;

template <class Container, class AllocatorT>
buffer(Container&, AllocatorT, const property_list& = {})
    -> buffer<typename Container::value_type, 1, AllocatorT>;

template <class Container>
buffer(Container&, const property_list& = {})
    -> buffer<typename Container::value_type, 1>;

} // namespace sycl
Table 39. Constructors of the buffer class
Constructor Description
buffer(const range<Dimensions>& bufferRange,
       const property_list& propList = {})

Construct a SYCL buffer instance with uninitialized memory. The constructed SYCL buffer will use a default constructed AllocatorT when allocating memory on the host. The range of the constructed SYCL buffer is specified by the bufferRange parameter provided. Data is not written back to the host on destruction of the buffer unless the buffer has a valid non-null pointer specified via the member function set_final_data(). Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

buffer(const range<Dimensions>& bufferRange,
       AllocatorT allocator,
       const property_list& propList = {})

Construct a SYCL buffer instance with uninitialized memory. The constructed SYCL buffer will use the allocator parameter provided when allocating memory on the host. The range of the constructed SYCL buffer is specified by the bufferRange parameter provided. Data is not written back to the host on destruction of the buffer unless the buffer has a valid non-null pointer specified via the member function set_final_data(). Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

buffer(T* hostData, const range<Dimensions>& bufferRange,
       const property_list& propList = {})

Construct a SYCL buffer instance with the hostData parameter provided. The buffer is initialized with the memory specified by hostData, and the buffer assumes exclusive access to this memory for the duration of its lifetime. The constructed SYCL buffer will use a default constructed AllocatorT when allocating memory on the host. The range of the constructed SYCL buffer is specified by the bufferRange parameter provided. Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

buffer(T* hostData, const range<Dimensions>& bufferRange,
       AllocatorT allocator,
       const property_list& propList = {})

Construct a SYCL buffer instance with the hostData parameter provided. The buffer is initialized with the memory specified by hostData, and the buffer assumes exclusive access to this memory for the duration of its lifetime. The constructed SYCL buffer will use the allocator parameter provided when allocating memory on the host. The range of the constructed SYCL buffer is specified by the bufferRange parameter provided. Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

buffer(const T* hostData,
       const range<Dimensions>& bufferRange,
       const property_list& propList = {})

Construct a SYCL buffer instance with the hostData parameter provided. The buffer assumes exclusive access to this memory for the duration of its lifetime.

The constructed SYCL buffer will use a default constructed AllocatorT when allocating memory on the host.

The host address is const T, so the host accesses can be read-only. However, the typename T is not const so the device accesses can be both read and write accesses. Since the hostData is const, this buffer is only initialized with this memory and there is no write back after its destruction, unless the buffer has another valid non-null final data address specified via the member function set_final_data() after construction of the buffer.

The range of the constructed SYCL buffer is specified by the bufferRange parameter provided.

Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

buffer(const T* hostData,
       const range<Dimensions>& bufferRange,
       AllocatorT allocator,
       const property_list& propList = {})

Construct a SYCL buffer instance with the hostData parameter provided. The buffer assumes exclusive access to this memory for the duration of its lifetime.

The constructed SYCL buffer will use the allocator parameter provided when allocating memory on the host.

The host address is const T, so the host accesses can be read-only. However, the typename T is not const so the device accesses can be both read and write accesses. Since, the hostData is const, this buffer is only initialized with this memory and there is no write back after its destruction, unless the buffer has another valid non-null final data address specified via the member function set_final_data() after construction of the buffer.

The range of the constructed SYCL buffer is specified by the bufferRange parameter provided.

Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

template <typename Container>
buffer(Container& container,
       const property_list& propList = {})

Construct a one dimensional SYCL buffer instance from the elements starting at std::data(container) and containing std::size(container) number of elements. The buffer is initialized with the contents of container, and the buffer assumes exclusive access to container for the duration of its lifetime.

Data is written back to container before the completion of buffer destruction if the return type of std::data(container) is not const.

The constructed SYCL buffer will use a default constructed AllocatorT when allocating memory on the host.

Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

This constructor is only defined for a buffer parameterized with Dimensions == 1, and when std::data(container) is convertible to T*.

template <typename Container>
buffer(Container& container, AllocatorT allocator,
       const property_list& propList = {})

Construct a one dimensional SYCL buffer instance from the elements starting at std::data(container) and containing std::size(container) number of elements. The buffer is initialized with the contents of container, and the buffer assumes exclusive access to container for the duration of its lifetime.

Data is written back to container before the completion of buffer destruction if the return type of std::data(container) is not const.

The constructed SYCL buffer will use the allocator parameter provided when allocating memory on the host.

Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

This constructor is only defined for a buffer parameterized with Dimensions == 1, and when std::data(container) is convertible to T*.

buffer(const std::shared_ptr<T>& hostData,
       const range<Dimensions>& bufferRange,
       const property_list& propList = {})

When hostData is not empty, construct a SYCL buffer with the contents of its stored pointer. The buffer assumes exclusive access to this memory for the duration of its lifetime. The buffer also creates its own internal copy of the shared_ptr that shares ownership of the hostData memory, which means the application can safely release ownership of this shared_ptr when the constructor returns.

When hostData is empty, construct a SYCL buffer with uninitialized memory.

The constructed SYCL buffer will use a default constructed AllocatorT when allocating memory on the host. The range of the constructed SYCL buffer is specified by the bufferRange parameter provided. Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

buffer(const std::shared_ptr<T>& hostData,
       const range<Dimensions>& bufferRange,
       AllocatorT allocator,
       const property_list& propList = {})

When hostData is not empty, construct a SYCL buffer with the contents of its stored pointer. The buffer assumes exclusive access to this memory for the duration of its lifetime. The buffer also creates its own internal copy of the shared_ptr that shares ownership of the hostData memory, which means the application can safely release ownership of this shared_ptr when the constructor returns.

When hostData is empty, construct a SYCL buffer with uninitialized memory.

The constructed SYCL buffer will use the allocator parameter provided when allocating memory on the host. The range of the constructed SYCL buffer is specified by the bufferRange parameter provided. Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

buffer(const std::shared_ptr<T[]>& hostData,
       const range<Dimensions>&  bufferRange,
       const property_list& propList = {})

When hostData is not empty, construct a SYCL buffer with the contents of its stored pointer. The buffer assumes exclusive access to this memory for the duration of its lifetime. The buffer also creates its own internal copy of the shared_ptr that shares ownership of the hostData memory, which means the application can safely release ownership of this shared_ptr when the constructor returns.

When hostData is empty, construct a SYCL buffer with uninitialized memory.

The constructed SYCL buffer will use a default constructed AllocatorT when allocating memory on the host. The range of the constructed SYCL buffer is specified by the bufferRange parameter provided. Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

buffer(const std::shared_ptr<T[]>& hostData,
       const range<Dimensions>& bufferRange,
       AllocatorT allocator,
       const property_list& propList = {})

When hostData is not empty, construct a SYCL buffer with the contents of its stored pointer. The buffer assumes exclusive access to this memory for the duration of its lifetime. The buffer also creates its own internal copy of the shared_ptr that shares ownership of the hostData memory, which means the application can safely release ownership of this shared_ptr when the constructor returns.

When hostData is empty, construct a SYCL buffer with uninitialized memory.

The constructed SYCL buffer will use the allocator parameter provided when allocating memory on the host. The range of the constructed SYCL buffer is specified by the bufferRange parameter provided. Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

template <typename InputIterator>
buffer(InputIterator first, InputIterator last,
       const property_list& propList = {})

Create a new allocated 1D buffer initialized from the given elements ranging from first up to one before last. The data is copied to an intermediate memory position by the runtime. Data is not written back to the same iterator set provided. However, if the buffer has a valid non-const iterator specified via the member function set_final_data(), data will be copied back to that iterator. The constructed SYCL buffer will use a default constructed AllocatorT when allocating memory on the host. Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

template <typename InputIterator>
buffer(InputIterator first, InputIterator last,
       AllocatorT allocator = {},
       const property_list& propList = {})

Create a new allocated 1D buffer initialized from the given elements ranging from first up to one before last. The data is copied to an intermediate memory position by the runtime. Data is not written back to the same iterator set provided. However, if the buffer has a valid non-const iterator specified via the member function set_final_data(), data will be copied back to that iterator. The constructed SYCL buffer will use the allocator parameter provided when allocating memory on the host. Zero or more properties can be provided to the constructed SYCL buffer via an instance of property_list.

buffer(buffer& b, const id<Dimensions>& baseIndex,
       const range<Dimensions>& subRange)

Create a new sub-buffer without allocation to have separate accessors later. b is the buffer with the real data, which must not be a sub-buffer. baseIndex specifies the origin of the sub-buffer inside the buffer b. subRange specifies the size of the sub-buffer. The sum of baseIndex and subRange in any dimension must not exceed the parent buffer (b) size (bufferRange) in that dimension, and an exception with the errc::invalid error code must be thrown if violated.

The offset and range specified by baseIndex and subRange together must represent a contiguous region of the original SYCL buffer.

If a non-contiguous region of a buffer is requested when constructing a sub-buffer, then an exception with the errc::invalid error code must be thrown.

The origin (based on baseIndex) of the sub-buffer being constructed must be a multiple of the memory base address alignment of each SYCL device which accesses data from the buffer. This value is retrievable via the SYCL device class info query info::device::mem_base_addr_align. Violating this requirement causes the implementation to throw an exception with the errc::invalid error code from the accessor constructor (if the accessor is not a placeholder) or from handler::require() (if the accessor is a placeholder). If the accessor is bound to a command group with a secondary queue, the sub-buffer’s alignment must be compatible with both the primary queue’s device and the secondary queue’s device, otherwise this exception is thrown.

Must throw an exception with the errc::invalid error code if b is a sub-buffer.

Table 40. Member functions for the buffer class
Member function Description
range<Dimensions> get_range() const

Return a range object representing the size of the buffer in terms of number of elements in each dimension as passed to the constructor.

size_t size() const noexcept

Returns the total number of elements in the buffer. Equal to get_range()[0] * ... * get_range()[Dimensions-1].

size_t get_count() const

Returns the same value as size(). Deprecated.

size_t byte_size() const noexcept

Returns the size of the buffer storage in bytes. Equal to size()*sizeof(T).

size_t get_size() const

Returns the same value as byte_size(). Deprecated.

AllocatorT get_allocator() const

Returns the allocator provided to the buffer.

template <access_mode Mode = access_mode::read_write,
          target Targ = target::device>
accessor<T, Dimensions, Mode, Targ> get_access(handler& commandGroupHandler)

Returns a valid accessor to the buffer with the specified access mode and target in the command group buffer. The value of target can be target::device or target::constant_buffer.

template <access_mode Mode>
accessor<T, Dimensions, Mode, target::host_buffer> get_access()

Deprecated in SYCL 2020. Use get_host_access() instead.

Returns a valid host accessor to the buffer with the specified access mode and target.

template <access_mode Mode = access_mode::read_write,
          target Targ = target::device>
accessor<T, Dimensions, Mode, Targ> get_access(handler& commandGroupHandler,
                                               range<Dimensions> accessRange,
                                               id<Dimensions> accessOffset = {})

Returns a valid accessor to the buffer with the specified access mode and target in the command group buffer. The accessor is a ranged accessor, where the range starts at the given offset from the beginning of the buffer. The value of target can be target::device or target::constant_buffer.

Throws an exception with the errc::invalid error code if the sum of accessRange and accessOffset exceeds the range of the buffer in any dimension.

template <access_mode Mode>
accessor<T, Dimensions, Mode, target::host_buffer>
get_access(range<Dimensions> accessRange, id<Dimensions> accessOffset = {})

Deprecated in SYCL 2020. Use get_host_access() instead.

Returns a valid host accessor to the buffer with the specified access mode and target. The accessor is a ranged accessor, where the range starts at the given offset from the beginning of the buffer. The value of target can only be target::host_buffer.

Throws an exception with the errc::invalid error code if the sum of accessRange and accessOffset exceeds the range of the buffer in any dimension.

template <typename... Ts> auto get_access(Ts... args)

Returns a valid accessor as if constructed via passing the buffer and all provided arguments to the accessor constructor.

Possible implementation:

return accessor{*this, args...};

template <typename... Ts> auto get_host_access(Ts... args)

Returns a valid host_accessor as if constructed via passing the buffer and all provided arguments to the host_accessor constructor.

Possible implementation:

return host_accessor{*this, args...};

template <typename Destination = std::nullptr_t>
void set_final_data(Destination finalData = nullptr)

The finalData points to where the outcome of all the buffer processing is going to be copied to at destruction time, if the buffer was involved with a write accessor.

Destination can be either an output iterator or a std::weak_ptr<T>.

Note that a raw pointer is a special case of output iterator and thus defines the host memory to which the result is to be copied.

In the case of a weak pointer, the output is not updated if the weak pointer has expired.

If Destination is std::nullptr_t, then the copy back will not happen.

void set_write_back(bool flag = true)

This member function allows dynamically forcing or canceling the write-back of the data of a buffer on destruction according to the value of flag.

Forcing the write-back is similar to what happens during a normal write-back as described in Section 4.7.2.3 and Section 4.7.4.

If there is nowhere to write-back, using this function does not have any effect.

bool is_sub_buffer() const

Returns true if this SYCL buffer is a sub-buffer, otherwise returns false.

template <typename ReinterpretT, int ReinterpretDim>
buffer<ReinterpretT, ReinterpretDim,
       typename std::allocator_traits<AllocatorT>::template rebind_alloc<
           std::remove_const_t<ReinterpretT>>>
reinterpret(range<ReinterpretDim> reinterpretRange) const

Creates and returns a reinterpreted SYCL buffer with the type specified by ReinterpretT, dimensions specified by ReinterpretDim and range specified by reinterpretRange. The buffer object being reinterpreted can be a SYCL sub-buffer that was created from a SYCL buffer and must throw exception with the errc::invalid error code if the total size in bytes represented by the type and range of the reinterpreted SYCL buffer (or sub-buffer) does not equal the total size in bytes represented by the type and range of this SYCL buffer (or sub-buffer). Reinterpreting a sub-buffer provides a reinterpreted view of the sub-buffer only, and does not change the offset or size of the sub-buffer view (in bytes) relative to the parent buffer.

template <typename ReinterpretT, int ReinterpretDim = Dimensions>
buffer<ReinterpretT, ReinterpretDim,
       typename std::allocator_traits<AllocatorT>::template rebind_alloc<
           std::remove_const_t<ReinterpretT>>>
reinterpret() const

Creates and returns a reinterpreted SYCL buffer with the type specified by ReinterpretT and dimensions specified by ReinterpretDim. Only valid when (ReinterpretDim == 1) or when ((ReinterpretDim == Dimensions) && (sizeof(ReinterpretT) == sizeof(T))). The buffer object being reinterpreted can be a SYCL sub-buffer that was created from a SYCL buffer. The implementation must throw an exception with the errc::invalid error code if the total size in bytes represented by this SYCL buffer (or sub-buffer) is not evenly divisible by sizeof(ReinterpretT). Reinterpreting a sub-buffer provides a reinterpreted view of the sub-buffer only, and does not change the offset or size of the sub-buffer view (in bytes) relative to the parent buffer.

4.7.2.2. Buffer properties

The properties that can be provided when constructing the SYCL buffer class are describe in Table 41.

Table 41. Properties supported by the SYCL buffer class
Property Description
property::buffer::use_host_ptr

The use_host_ptr property adds the requirement that the SYCL runtime must not allocate any memory for the SYCL buffer and instead uses the provided host pointer directly. This prevents the SYCL runtime from allocating additional temporary storage on the host.

This property has a special guarantee for buffers that are constructed from a hostData pointer. If a host_accessor is constructed from such a buffer, then the address of the reference type returned from the accessor’s member functions such as operator[](id<>) will be the same as the corresponding hostData address.

property::buffer::use_mutex

The use_mutex property is valid for the SYCL buffer, unsampled_image and sampled_image classes. The property adds the requirement that the memory which is owned by the SYCL buffer can be shared with the application via a std::mutex provided to the property. The mutex m is locked by the runtime whenever the data is in use and unlocked otherwise. Data is synchronized with hostData, when the mutex is unlocked by the runtime.

property::buffer::context_bound

The context_bound property adds the requirement that the SYCL buffer can only be associated with a single SYCL context that is provided to the property.

The constructors and special member functions of the buffer property classes are listed in Table 42 and Table 43 respectively.

Table 42. Constructors of the buffer property classes
Constructor Description
property::buffer::use_host_ptr::use_host_ptr()

Constructs a SYCL use_host_ptr property instance.

property::buffer::use_mutex::use_mutex(std::mutex& mutexRef)

Constructs a SYCL use_mutex property instance with a reference to mutexRef parameter provided.

property::buffer::context_bound::context_bound(context boundContext)

Constructs a SYCL context_bound property instance with a copy of a SYCL context.

Table 43. Member functions of the buffer property classes
Member function Description
std::mutex* property::buffer::use_mutex::get_mutex_ptr() const

Returns the std::mutex which was specified when constructing this SYCL use_mutex property.

context property::buffer::context_bound::get_context() const

Returns the context which was specified when constructing this SYCL context_bound property.

4.7.2.3. Buffer synchronization rules

Buffers are reference-counted. When a buffer value is constructed from another buffer, the two values reference the same buffer and a reference count is incremented. When a buffer value is destroyed, the reference count is decremented. Only when there are no more buffer values that reference a specific buffer is the actual buffer destroyed and the buffer destruction behavior defined below is followed.

If any error occurs on buffer destruction, it is reported via the associated queue’s asynchronous error handling mechanism.

The basic rule for the blocking behavior of a buffer destructor is that it blocks if there is some data to write back because a write accessor on it has been created, or if the buffer was constructed with attached host memory and is still in use.

More precisely:

  1. A buffer can be constructed from a range (and without a hostData pointer). The memory management for this type of buffer is entirely handled by the SYCL system. The destructor for this type of buffer does not need to block, even if work on the buffer has not completed. Instead, the SYCL system frees any storage required for the buffer asynchronously when it is no longer in use in queues. The initial contents of the buffer are unspecified.

  2. A buffer can be constructed from a hostData pointer. The buffer will use this host memory for its full lifetime, but the contents of this host memory are unspecified for the lifetime of the buffer. If the host memory is modified on the host or if it is used to construct another buffer or image during the lifetime of this buffer, then the results are undefined. The initial contents of the buffer will be the contents of the host memory at the time of construction.

    When the buffer is destroyed, the destructor will block until all work in queues on the buffer have completed, then copy the contents of the buffer back to the host memory (if required) and then return.

    1. If the type of the host data is const, then the buffer is read-only; only read accessors are allowed on the buffer and no-copy-back to host memory is performed (although the host memory must still be kept available for use by SYCL). When using the default buffer allocator, the const-ness of the type will be removed in order to allow host allocation of memory, which will allow temporary host copies of the data by the SYCL runtime, for example for speeding up host accesses.

      When the buffer is destroyed, the destructor will block until all work in queues on the buffer have completed and then return, as there is no copy of data back to host.

    2. If the type of the host data is not const but the pointer to host data is const, then the read-only restriction applies only on host and not on device accesses.

      When the buffer is destroyed, the destructor will block until all work in queues on the buffer have completed.

  3. A buffer can be constructed using a shared_ptr to host data. This pointer is shared between the SYCL application and the runtime. In order to allow synchronization between the application and the runtime a mutex is used which will be locked by the runtime whenever the data is in use, and unlocked when it is no longer needed.

    The shared_ptr reference counting is used in order to prevent destroying the buffer host data prematurely. If the shared_ptr is deleted from the user application before buffer destruction, the buffer can continue securely because the pointer hasn’t been destroyed yet. It will not copy data back to the host before destruction, however, as the application side has already deleted its copy.

    Note that since there is an implicit conversion of a std::unique_ptr to a std::shared_ptr, a std::unique_ptr can also be used to pass the ownership to the SYCL runtime.

  4. A buffer can be constructed from a pair of iterator values. In this case, the buffer construction will copy the data from the data range defined by the iterator pair. The destructor will not copy back any data and does not need to block.

  5. A buffer can be constructed from a container on which std::data(container) and std::size(container) are well-formed. The initial contents of the buffer will be the contents of the container at the time of construction.

    The buffer may use the memory within the container for its full lifetime, and the contents of this memory are unspecified for the lifetime of the buffer. If the container memory is modified by the host during the lifetime of this buffer, then the results are undefined.

    When the buffer is destroyed, the destructor will block until all work in queues on the buffer have completed. If the return type of std::data(container) is not const then the destructor will also copy the contents of the buffer to the container (if required).

If set_final_data() is used to change where to write the data back to, then the destructor of the buffer will block if a write accessor on it has been created.

A sub-buffer object can be created which is a sub-range reference to a base buffer. This sub-buffer can be used to create accessors to the base buffer, which have access to the range specified at time of construction of the sub-buffer. Sub-buffers cannot be created from sub-buffers, but only from a base buffer which is not already a sub-buffer.

Sub-buffers must be constructed from a contiguous region of memory in a buffer. This requirement is potentially non-intuitive when working with buffers that have dimensionality larger than one, but maps to one-dimensional SYCL backend native allocations without performance cost due to index mapping computation. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
buffer<int, 2> parent_buffer { range<2> {
    8, 8 } }; // Create 2-d buffer with 8x8 ints

// OK: Contiguous region from middle of buffer
buffer<int, 2> sub_buf1 { parent_buffer, /*offset*/ range<2> { 2, 0 },
                          /*size*/ range<2> { 2, 8 } };

// invalid exception: Non-contiguous regions of 2-d buffer
buffer<int, 2