Name

    NVX_linked_gpu_multicast

Name Strings

    GL_NVX_linked_gpu_multicast

Contact

    Joshua Schnarr, NVIDIA Corporation (jschnarr 'at' nvidia.com)
    Ingo Esser, NVIDIA Corporation (iesser 'at' nvidia.com)

Contributors

    Christoph Kubisch, NVIDIA
    Mark Kilgard, NVIDIA

Status

    Shipping in NVIDIA release 361 drivers.

Version

    Last Modified Date:         July 21, 2016
    NVIDIA Revision:            4

Number

    OpenGL Extension #493

Dependencies

    This extension is written against the OpenGL 4.5 specification (Compatibility Profile), dated
    February 2, 2015.

    This extension interacts with ARB_sparse_buffer.

    This extension interacts with ARB_copy_image.

    This extension interacts with EXT_direct_state_access.

    This extension interacts with ARB_shader_viewport_layer_array.

Overview

    This extension enables novel multi-GPU rendering techniques by providing application control
    over a group of linked GPUs with identical hardware configuration.

    Multi-GPU rendering techniques fall into two categories: implicit and explicit.  Existing
    explicit approaches like WGL_NV_gpu_affinity have two main drawbacks: CPU overhead and
    application complexity.  An application must manage one context per GPU and multi-pump the API
    stream.  Implicit multi-GPU rendering techniques avoid these issues by broadcasting rendering
    from one context to multiple GPUs.  Common implicit approaches include alternate-frame
    rendering (AFR), split-frame rendering (SFR) and multi-GPU anti-aliasing.  They each have
    drawbacks.  AFR scales nicely but interacts poorly with inter-frame dependencies.  SFR can
    improve latency but has challenges with offscreen rendering and scaling of vertex processing.
    With multi-GPU anti-aliasing, each GPU renders the same content with alternate sample
    positions and the driver blends the result to improve quality.  This also has issues with
    offscreen rendering and can conflict with other anti-aliasing techniques.
    
    These issues with implicit multi-GPU rendering all have the same root cause: the driver lacks
    adequate knowledge to accelerate every application.  To resolve this, NVX_linked_gpu_multicast
    provides application control over multiple GPUs with a single context.

    Key points:

    - One context controls multiple GPUs.  Every GPU in the linked group can access every object.

    - Rendering is broadcast.  Each draw is repeated across all GPUs in the linked group.

    - Each GPU gets its own instance of all framebuffers and attached textures, allowing
      individualized output for each GPU.  Input data can be customized for each GPU using buffers
      created with the storage flag, LGPU_SEPARATE_STORAGE_BIT_NVX and a new API,
      LGPUNamedBufferSubDataNVX. 

    - Textures can be transferred from one GPU to another using LGPUCopyImageSubDataNVX.
    
    
New Procedures and Functions

    void LGPUNamedBufferSubDataNVX(
        bitfield gpuMask, uint buffer,
        intptr offset, sizeiptr size,
        const void *data);

    void LGPUCopyImageSubDataNVX(
        uint sourceGpu, bitfield destinationGpuMask,
        uint srcName, enum srcTarget, 
        int srcLevel,
        int srcX, int srxY, int srcZ,
        uint dstName, enum dstTarget,
        int dstLevel,
        int dstX, int dstY, int dstZ,
        sizei width, sizei height, sizei depth);

    void LGPUInterlockNVX(void);
    
New Tokens

    Accepted in the <flags> parameter of BufferStorage and
    NamedBufferStorageEXT:

        LGPU_SEPARATE_STORAGE_BIT_NVX               0x0800

    Accepted by the <pname> parameter of GetBooleanv, GetIntegerv,
    GetInteger64v, GetFloatv, and GetDoublev:

        MAX_LGPU_GPUS_NVX                           0x92BA

Additions to the OpenGL 4.5 Specification (Compatibility Profile)

    (Add a new chapter after chapter 19 "Compute Shaders")

    20 Multicast Rendering

    This chapter specifies commands for using multiple GPUs in a linked group.  Commands are
    multicast, or repeated across all linked GPUs.  Objects are shared by all GPUs, however each
    GPU has its own instance (copy) of many resources, including framebuffers.  When each GPU has
    its own instance of a resource, it is considered to have per-GPU storage.  When all GPUs share
    a single instance of a resource, this is considered GPU-shared storage. 

    The mechanism for linking GPUs is implementation specific, as is the process-global mechanism
    for enabling multicast rendering support (if necessary).  The number of GPUs usable for
    multicast rendering by a context can be queried by calling GetIntegerv with the symbolic
    constant MAX_LGPU_GPUS_NVX.  Individual GPUs are identified using zero-based indices in the
    range [0, n-1], where n is the number of multicast GPUs.  GPUs are also be identified by
    bitmasks of the form 2^i, where i is the GPU index.  A set of GPUs is specified by the union of
    masks for each GPU in the set.

    20.1 Multi-GPU Buffer Storage

    Like other resources, buffer objects can have two types of storage, per-GPU storage or
    GPU-shared storage.  Per-GPU storage can be explicitly requested using the
    LGPU_SEPARATE_STORAGE_BIT_NVX flag with BufferStorage/NamedBufferStorageEXT.  If this flag is
    not set, the type of storage used is undefined.  The implementation may use either type
    and transition between them at any time.  Client reads of a buffer with per-GPU storage may
    source from any GPU.

    The following rules apply to buffer objects with per-GPU storage:

      When mapped with WRITE_ONLY access, writes apply to all GPUs.
      When bound to UNIFORM_BUFFER, client uniform updates apply to all GPUs.
      When used as the write buffer for CopyBufferSubData or CopyNamedBufferSubData, writes apply to
      all GPUs.

    The following commands affect storage on all GPUs, even if the the buffer object has per-GPU
    storage:

      BufferSubData, NamedBufferSubData, ClearBufferSubData, and ClearNamedBufferData

    An INVALID_VALUE error is generated if BufferStorage/NamedBufferStorageEXT is called with
    LGPU_SEPARATE_STORAGE_BIT_NVX set with MAP_PERSISTENT_BIT or SPARSE_STORAGE_BIT_ARB.

    To modify buffer object data on one or more GPUs, the client may use the command

    void LGPUNamedBufferSubDataNVX(
        bitfield gpuMask, uint buffer,
        intptr offset, sizeiptr size,
        const void *data);

    This function operates similarly to NamedBufferSubData, except that it updates the per-GPU
    buffer data on the set of GPUs defined by <gpuMask>.  

    An INVALID_VALUE error is generated if <gpuMask> is zero.
    An INVALID_OPERATION error is generated if <buffer> is not the name of an existing buffer
    object.
    An INVALID_VALUE error is generated if <offset> or <size> is negative, or if <offset> + <size>
    is greater than the value of BUFFER_SIZE for the buffer object.
    An INVALID_OPERATION error is generated if any part of the specified buffer range is mapped
    with MapBufferRange or MapBuffer (see section 6.3), unless it was mapped with
    MAP_PERSISTENT_BIT set in the MapBufferRange access flags.
    An INVALID_OPERATION error is generated if the BUFFER_IMMUTABLE_STORAGE flag of the buffer
    object is TRUE and the value of BUFFER_STORAGE_FLAGS for the buffer does not have the
    DYNAMIC_STORAGE_BIT set.

    20.2 Multi-GPU Framebuffers and Textures

    All buffers in the default framebuffer as well as renderbuffers and textures bound to
    framebuffer objects receive per-GPU storage.  Storage for other textures is undefined: it may
    be per-GPU or GPU-shared and can transition between the types at any time. 

    To copy texel data between GPUs, the client may use the command

    void LGPUCopyImageSubDataNVX(
        uint sourceGpu, bitfield destinationGpuMask,
        uint srcName, enum srcTarget, 
        int srcLevel,
        int srcX, int srxY, int srcZ,
        uint dstName, enum dstTarget,
        int dstLevel,
        int dstX, int dstY, int dstZ,
        sizei width, sizei height, sizei depth);

    This function operates similarly to CopyImageSubData, except that it takes a source GPU
    and a destination GPU set defined by <destinationGpuMask>.

    INVALID_ENUM is generated
     * if either <srcTarget> or <dstTarget> 
      - is not RENDERBUFFER or a valid non-proxy texture target
      - is TEXTURE_BUFFER, or
      - is one of the cubemap face selectors described in table 3.17,
     * if the target does not match the type of the object.

    INVALID_OPERATION is generated
     * if either object is a texture and the texture is not complete,
     * if the source and destination formats are not compatible,
     * if the source and destination number of samples do not match,
     * if one image is compressed and the other is uncompressed and the
       block size of compressed image is not equal to the texel size
       of the compressed image.

    INVALID_VALUE is generated
     * if <sourceGpu> is greater than or equal to MAX_LGPU_GPUS_NVX,
     * if <destinationGpuMask> is zero,
     * if either <srcName> or <dstName> does not correspond to a valid
       renderbuffer or texture object according to the corresponding
       target parameter, or
     * if the specified level is not a valid level for the image, or
     * if the dimensions of the either subregion exceeds the boundaries 
       of the corresponding image object, or
     * if the image format is compressed and the dimensions of the
       subregion fail to meet the alignment constraints of the format.


    20.3 Multi-GPU Synchronization

    LGPUCopyImageSubDataNVX provides implicit synchronization with previous rendering to the given
    texture or renderbuffer on the source GPU.  Synchronization of the copy with the destination
    GPU(s) is achieved with the interlock function:

      void LGPUInterlockNVX(void)

    This is called to synchronize all linked GPUs to the same point in the API stream.  To
    guarantee consistency, the interlock command must be used as a barrier between any two
    accesses by multiple GPUs to the same memory when at least one of the accesses is a write.
    For consistent copies between GPUs, synchronization is required before and after each copy:
    
    1. Prior to each call to LGPUCopyImageSubDataNVX, LGPUInterlockNVX() must be called after
    the most recent read or write of the target image by a destination GPU.

    2. After each call to LGPUCopyImageSubDataNVX, LGPUInterlockNVX() must be called 
    prior to any future read or write of the target image by a destination GPU.

    GPU writes and reads to/from GPU-shared locations require synchronization as well.  GPU writes
    such as transform feedback, shader image store, CopyTexImage, CopyBufferSubData are not
    automatically synchronized with writes by other GPUs.  Neither are GPU reads such as texture
    fetches, shader image loads, CopyTexImage, etc. synchronized with writes by other GPUs.
    Existing barriers such as TextureBarrier and MemoryBarrier only provide consistency guarantees
    for rendering, writes and reads on a single GPU.
    
          
    Additions to the AGL/GLX/WGL Specifications

        None

GLX Protocol

    None

Errors

    Relaxation of INVALID_ENUM errors
    ---------------------------------
    GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and GetDoublev now accept new tokens as
    described in the "New Tokens" section.

New State

    None

New Implementation Dependent State

    Add to Table 23.82, Implementation-Dependent Values, p. 784

                                                Minimum
    Get Value               Type   Get Command  Value   Description               Sec.  Attribute
    ----------------------  ----   -----------  ------- -----------------------   ----  ---------
    MAX_LGPU_GPUS_NVX        Z+   GetIntegerv      2    Maximum number of         6.9     -
                                                        usable GPUs
Sample Code

    Binocular stereo rendering example using NVX_linked_gpu_multicast with single GPU fallback:
   
    struct ViewData {
        GLint viewport_index;
        GLfloat mvp[16];
        GLfloat modelview[16];
    };
    ViewData leftViewData = { 0, {...}, {...} };
    ViewData rightViewData = { 1, {...}, {...} };

    GLuint ubo[2];
    glCreateBuffers(2, &ubo[0]);

    if (has_NVX_linked_gpu_multicast) {
        glNamedBufferStorage(ubo[0], size, NULL, GL_LGPU_SEPARATE_STORAGE_BIT_NVX | GL_DYNAMIC_STORAGE_BIT);
        glLGPUNamedBufferSubDataNVX(0x1, ubo[0], 0, size, &leftViewData);
        glLGPUNamedBufferSubDataNVX(0x2, ubo[0], 0, size, &rightViewData);
    } else {
        glNamedBufferStorage(ubo[0], size, &leftViewData, 0);
        glNamedBufferStorage(ubo[1], size, &rightViewData, 0);
    }

    glViewportIndexedf(0, 0, 0, 640, 480);  // left viewport
    glViewportIndexedf(1, 640, 0, 640, 480);  // right viewport
    // Vertex shader sets gl_ViewportIndex according to viewport_index in UBO

    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

    if (has_NVX_linked_gpu_multicast) {
        glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]);
        drawScene();
        // Make GPU 1 wait for glClear above to complete on GPU 0
        glLGPUInterlockNVX();
        // Copy right viewport from GPU 1 to GPU 0
        glLGPUCopyImageSubDataNVX(1, 0x1,
                                  renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0,
                                  renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0,
                                  640, 480, 1);
        // Make GPU 0 wait for GPU 1 copy to GPU 0
        glLGPUInterlockNVX();
    } else {
        glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]);
        drawScene();
        glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[1]);
        drawScene();
    }
    // Both viewports are now present in GPU 0's renderbuffer

Issues

  (1) Should we provide explicit inter-gpu synchronization API?  Will this make the implementation
    easier or harder for the driver and applications?

    RESOLVED. Yes. A naive implementation of implicit synchronization would simply interlock the
    GPUs before and after each copy.  Smart implicit synchronization would have to track all APIs
    that can modify buffers and textures, creating an excessive burden for driver implementation
    and maintenance.  An application can track dependencies more easily and outperform a naive
    driver implementation using explicit synchronization.

  (2) How does this extension interact with queries (e.g. occlusion queries)?

    RESOLVED. Queries are performed separately on each GPU. The standard GetQueryObject* APIs
    return query results for GPU 0 only. However GetQueryBufferObject* can be used to retrieve
    query results for all GPUs through a buffer with separate storage (LGPU_SEPARATE_STORAGE_BIT).

  (3) Which textures and buffers have separate storage for each GPU?
  
    The default framebuffer and framebuffer texture attachments. Also buffers allocated with
    LGPU_SEPARATE_STORAGE_BIT. Other buffers and textures may or may not have separate storage.

  (4) Should we provide a mechanism to modify viewports independently for each GPU?

    RESOLVED. No. This can be achieved using multicast UBOs and ARB_shader_viewport_layer_array.

  (5) Should we expose this extension on single-GPU configurations?

    RESOLVED. No. The extension provides no value unless MULTICAST_GPUS_NV > 1.  Limiting exposure
    to these configurations guarantees that at least two GPUs will be available when the extension
    is reported.

  (6) Can rendering be enabled/disabled on a specific subset of GPUs?

    This functionality will be added in a future version of this extension.

  (7) Should glGet*BufferParameter* return the LGPU_SEPARATE_STORAGE_BIT_NVX bit when
    BUFFER_STORAGE_FLAGS is queried?

    RESOLVED. Yes. BUFFER_STORAGE_FLAGS must match the flags parameter input to *BufferStorage, as
    specified in table 6.3.

Revision History

    Rev.    Date    Author    Changes
    ----  --------  --------  -----------------------------------------
     4    07/21/16  mjk       Register extension