Name NVX_linked_gpu_multicast Name Strings GL_NVX_linked_gpu_multicast Contact Joshua Schnarr, NVIDIA Corporation (jschnarr 'at' nvidia.com) Ingo Esser, NVIDIA Corporation (iesser 'at' nvidia.com) Contributors Christoph Kubisch, NVIDIA Mark Kilgard, NVIDIA Status Shipping in NVIDIA release 361 drivers. Version Last Modified Date: July 21, 2016 NVIDIA Revision: 4 Number OpenGL Extension #493 Dependencies This extension is written against the OpenGL 4.5 specification (Compatibility Profile), dated February 2, 2015. This extension interacts with ARB_sparse_buffer. This extension interacts with ARB_copy_image. This extension interacts with EXT_direct_state_access. This extension interacts with ARB_shader_viewport_layer_array. Overview This extension enables novel multi-GPU rendering techniques by providing application control over a group of linked GPUs with identical hardware configuration. Multi-GPU rendering techniques fall into two categories: implicit and explicit. Existing explicit approaches like WGL_NV_gpu_affinity have two main drawbacks: CPU overhead and application complexity. An application must manage one context per GPU and multi-pump the API stream. Implicit multi-GPU rendering techniques avoid these issues by broadcasting rendering from one context to multiple GPUs. Common implicit approaches include alternate-frame rendering (AFR), split-frame rendering (SFR) and multi-GPU anti-aliasing. They each have drawbacks. AFR scales nicely but interacts poorly with inter-frame dependencies. SFR can improve latency but has challenges with offscreen rendering and scaling of vertex processing. With multi-GPU anti-aliasing, each GPU renders the same content with alternate sample positions and the driver blends the result to improve quality. This also has issues with offscreen rendering and can conflict with other anti-aliasing techniques. These issues with implicit multi-GPU rendering all have the same root cause: the driver lacks adequate knowledge to accelerate every application. To resolve this, NVX_linked_gpu_multicast provides application control over multiple GPUs with a single context. Key points: - One context controls multiple GPUs. Every GPU in the linked group can access every object. - Rendering is broadcast. Each draw is repeated across all GPUs in the linked group. - Each GPU gets its own instance of all framebuffers and attached textures, allowing individualized output for each GPU. Input data can be customized for each GPU using buffers created with the storage flag, LGPU_SEPARATE_STORAGE_BIT_NVX and a new API, LGPUNamedBufferSubDataNVX. - Textures can be transferred from one GPU to another using LGPUCopyImageSubDataNVX. New Procedures and Functions void LGPUNamedBufferSubDataNVX( bitfield gpuMask, uint buffer, intptr offset, sizeiptr size, const void *data); void LGPUCopyImageSubDataNVX( uint sourceGpu, bitfield destinationGpuMask, uint srcName, enum srcTarget, int srcLevel, int srcX, int srxY, int srcZ, uint dstName, enum dstTarget, int dstLevel, int dstX, int dstY, int dstZ, sizei width, sizei height, sizei depth); void LGPUInterlockNVX(void); New Tokens Accepted in the parameter of BufferStorage and NamedBufferStorageEXT: LGPU_SEPARATE_STORAGE_BIT_NVX 0x0800 Accepted by the parameter of GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and GetDoublev: MAX_LGPU_GPUS_NVX 0x92BA Additions to the OpenGL 4.5 Specification (Compatibility Profile) (Add a new chapter after chapter 19 "Compute Shaders") 20 Multicast Rendering This chapter specifies commands for using multiple GPUs in a linked group. Commands are multicast, or repeated across all linked GPUs. Objects are shared by all GPUs, however each GPU has its own instance (copy) of many resources, including framebuffers. When each GPU has its own instance of a resource, it is considered to have per-GPU storage. When all GPUs share a single instance of a resource, this is considered GPU-shared storage. The mechanism for linking GPUs is implementation specific, as is the process-global mechanism for enabling multicast rendering support (if necessary). The number of GPUs usable for multicast rendering by a context can be queried by calling GetIntegerv with the symbolic constant MAX_LGPU_GPUS_NVX. Individual GPUs are identified using zero-based indices in the range [0, n-1], where n is the number of multicast GPUs. GPUs are also be identified by bitmasks of the form 2^i, where i is the GPU index. A set of GPUs is specified by the union of masks for each GPU in the set. 20.1 Multi-GPU Buffer Storage Like other resources, buffer objects can have two types of storage, per-GPU storage or GPU-shared storage. Per-GPU storage can be explicitly requested using the LGPU_SEPARATE_STORAGE_BIT_NVX flag with BufferStorage/NamedBufferStorageEXT. If this flag is not set, the type of storage used is undefined. The implementation may use either type and transition between them at any time. Client reads of a buffer with per-GPU storage may source from any GPU. The following rules apply to buffer objects with per-GPU storage: When mapped with WRITE_ONLY access, writes apply to all GPUs. When bound to UNIFORM_BUFFER, client uniform updates apply to all GPUs. When used as the write buffer for CopyBufferSubData or CopyNamedBufferSubData, writes apply to all GPUs. The following commands affect storage on all GPUs, even if the the buffer object has per-GPU storage: BufferSubData, NamedBufferSubData, ClearBufferSubData, and ClearNamedBufferData An INVALID_VALUE error is generated if BufferStorage/NamedBufferStorageEXT is called with LGPU_SEPARATE_STORAGE_BIT_NVX set with MAP_PERSISTENT_BIT or SPARSE_STORAGE_BIT_ARB. To modify buffer object data on one or more GPUs, the client may use the command void LGPUNamedBufferSubDataNVX( bitfield gpuMask, uint buffer, intptr offset, sizeiptr size, const void *data); This function operates similarly to NamedBufferSubData, except that it updates the per-GPU buffer data on the set of GPUs defined by . An INVALID_VALUE error is generated if is zero. An INVALID_OPERATION error is generated if is not the name of an existing buffer object. An INVALID_VALUE error is generated if or is negative, or if + is greater than the value of BUFFER_SIZE for the buffer object. An INVALID_OPERATION error is generated if any part of the specified buffer range is mapped with MapBufferRange or MapBuffer (see section 6.3), unless it was mapped with MAP_PERSISTENT_BIT set in the MapBufferRange access flags. An INVALID_OPERATION error is generated if the BUFFER_IMMUTABLE_STORAGE flag of the buffer object is TRUE and the value of BUFFER_STORAGE_FLAGS for the buffer does not have the DYNAMIC_STORAGE_BIT set. 20.2 Multi-GPU Framebuffers and Textures All buffers in the default framebuffer as well as renderbuffers and textures bound to framebuffer objects receive per-GPU storage. Storage for other textures is undefined: it may be per-GPU or GPU-shared and can transition between the types at any time. To copy texel data between GPUs, the client may use the command void LGPUCopyImageSubDataNVX( uint sourceGpu, bitfield destinationGpuMask, uint srcName, enum srcTarget, int srcLevel, int srcX, int srxY, int srcZ, uint dstName, enum dstTarget, int dstLevel, int dstX, int dstY, int dstZ, sizei width, sizei height, sizei depth); This function operates similarly to CopyImageSubData, except that it takes a source GPU and a destination GPU set defined by . INVALID_ENUM is generated * if either or - is not RENDERBUFFER or a valid non-proxy texture target - is TEXTURE_BUFFER, or - is one of the cubemap face selectors described in table 3.17, * if the target does not match the type of the object. INVALID_OPERATION is generated * if either object is a texture and the texture is not complete, * if the source and destination formats are not compatible, * if the source and destination number of samples do not match, * if one image is compressed and the other is uncompressed and the block size of compressed image is not equal to the texel size of the compressed image. INVALID_VALUE is generated * if is greater than or equal to MAX_LGPU_GPUS_NVX, * if is zero, * if either or does not correspond to a valid renderbuffer or texture object according to the corresponding target parameter, or * if the specified level is not a valid level for the image, or * if the dimensions of the either subregion exceeds the boundaries of the corresponding image object, or * if the image format is compressed and the dimensions of the subregion fail to meet the alignment constraints of the format. 20.3 Multi-GPU Synchronization LGPUCopyImageSubDataNVX provides implicit synchronization with previous rendering to the given texture or renderbuffer on the source GPU. Synchronization of the copy with the destination GPU(s) is achieved with the interlock function: void LGPUInterlockNVX(void) This is called to synchronize all linked GPUs to the same point in the API stream. To guarantee consistency, the interlock command must be used as a barrier between any two accesses by multiple GPUs to the same memory when at least one of the accesses is a write. For consistent copies between GPUs, synchronization is required before and after each copy: 1. Prior to each call to LGPUCopyImageSubDataNVX, LGPUInterlockNVX() must be called after the most recent read or write of the target image by a destination GPU. 2. After each call to LGPUCopyImageSubDataNVX, LGPUInterlockNVX() must be called prior to any future read or write of the target image by a destination GPU. GPU writes and reads to/from GPU-shared locations require synchronization as well. GPU writes such as transform feedback, shader image store, CopyTexImage, CopyBufferSubData are not automatically synchronized with writes by other GPUs. Neither are GPU reads such as texture fetches, shader image loads, CopyTexImage, etc. synchronized with writes by other GPUs. Existing barriers such as TextureBarrier and MemoryBarrier only provide consistency guarantees for rendering, writes and reads on a single GPU. Additions to the AGL/GLX/WGL Specifications None GLX Protocol None Errors Relaxation of INVALID_ENUM errors --------------------------------- GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and GetDoublev now accept new tokens as described in the "New Tokens" section. New State None New Implementation Dependent State Add to Table 23.82, Implementation-Dependent Values, p. 784 Minimum Get Value Type Get Command Value Description Sec. Attribute ---------------------- ---- ----------- ------- ----------------------- ---- --------- MAX_LGPU_GPUS_NVX Z+ GetIntegerv 2 Maximum number of 6.9 - usable GPUs Sample Code Binocular stereo rendering example using NVX_linked_gpu_multicast with single GPU fallback: struct ViewData { GLint viewport_index; GLfloat mvp[16]; GLfloat modelview[16]; }; ViewData leftViewData = { 0, {...}, {...} }; ViewData rightViewData = { 1, {...}, {...} }; GLuint ubo[2]; glCreateBuffers(2, &ubo[0]); if (has_NVX_linked_gpu_multicast) { glNamedBufferStorage(ubo[0], size, NULL, GL_LGPU_SEPARATE_STORAGE_BIT_NVX | GL_DYNAMIC_STORAGE_BIT); glLGPUNamedBufferSubDataNVX(0x1, ubo[0], 0, size, &leftViewData); glLGPUNamedBufferSubDataNVX(0x2, ubo[0], 0, size, &rightViewData); } else { glNamedBufferStorage(ubo[0], size, &leftViewData, 0); glNamedBufferStorage(ubo[1], size, &rightViewData, 0); } glViewportIndexedf(0, 0, 0, 640, 480); // left viewport glViewportIndexedf(1, 640, 0, 640, 480); // right viewport // Vertex shader sets gl_ViewportIndex according to viewport_index in UBO glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); if (has_NVX_linked_gpu_multicast) { glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]); drawScene(); // Make GPU 1 wait for glClear above to complete on GPU 0 glLGPUInterlockNVX(); // Copy right viewport from GPU 1 to GPU 0 glLGPUCopyImageSubDataNVX(1, 0x1, renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0, renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0, 640, 480, 1); // Make GPU 0 wait for GPU 1 copy to GPU 0 glLGPUInterlockNVX(); } else { glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]); drawScene(); glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[1]); drawScene(); } // Both viewports are now present in GPU 0's renderbuffer Issues (1) Should we provide explicit inter-gpu synchronization API? Will this make the implementation easier or harder for the driver and applications? RESOLVED. Yes. A naive implementation of implicit synchronization would simply interlock the GPUs before and after each copy. Smart implicit synchronization would have to track all APIs that can modify buffers and textures, creating an excessive burden for driver implementation and maintenance. An application can track dependencies more easily and outperform a naive driver implementation using explicit synchronization. (2) How does this extension interact with queries (e.g. occlusion queries)? RESOLVED. Queries are performed separately on each GPU. The standard GetQueryObject* APIs return query results for GPU 0 only. However GetQueryBufferObject* can be used to retrieve query results for all GPUs through a buffer with separate storage (LGPU_SEPARATE_STORAGE_BIT). (3) Which textures and buffers have separate storage for each GPU? The default framebuffer and framebuffer texture attachments. Also buffers allocated with LGPU_SEPARATE_STORAGE_BIT. Other buffers and textures may or may not have separate storage. (4) Should we provide a mechanism to modify viewports independently for each GPU? RESOLVED. No. This can be achieved using multicast UBOs and ARB_shader_viewport_layer_array. (5) Should we expose this extension on single-GPU configurations? RESOLVED. No. The extension provides no value unless MULTICAST_GPUS_NV > 1. Limiting exposure to these configurations guarantees that at least two GPUs will be available when the extension is reported. (6) Can rendering be enabled/disabled on a specific subset of GPUs? This functionality will be added in a future version of this extension. (7) Should glGet*BufferParameter* return the LGPU_SEPARATE_STORAGE_BIT_NVX bit when BUFFER_STORAGE_FLAGS is queried? RESOLVED. Yes. BUFFER_STORAGE_FLAGS must match the flags parameter input to *BufferStorage, as specified in table 6.3. Revision History Rev. Date Author Changes ---- -------- -------- ----------------------------------------- 4 07/21/16 mjk Register extension