Name NV_gpu_multicast Name Strings GL_NV_gpu_multicast Contact Joshua Schnarr, NVIDIA Corporation (jschnarr 'at' nvidia.com) Ingo Esser, NVIDIA Corporation (iesser 'at' nvidia.com) Contributors Christoph Kubisch, NVIDIA Mark Kilgard, NVIDIA Robert Menzel, NVIDIA Kevin Lefebvre, NVIDIA Ralf Biermann, NVIDIA Status Shipping in NVIDIA release 370.XX drivers and up. Version Last Modified Date: April 2, 2019 Revision: 7 Number OpenGL Extension #494 Dependencies This extension is written against the OpenGL 4.5 specification (Compatibility Profile), dated February 2, 2015. This extension requires ARB_copy_image. This extension interacts with ARB_sample_locations. This extension interacts with ARB_sparse_buffer. This extension requires EXT_direct_state_access. This extension interacts with EXT_bindable_uniform Overview This extension enables novel multi-GPU rendering techniques by providing application control over a group of linked GPUs with identical hardware configuration. Multi-GPU rendering techniques fall into two categories: implicit and explicit. Existing explicit approaches like WGL_NV_gpu_affinity have two main drawbacks: CPU overhead and application complexity. An application must manage one context per GPU and multi-pump the API stream. Implicit multi-GPU rendering techniques avoid these issues by broadcasting rendering from one context to multiple GPUs. Common implicit approaches include alternate-frame rendering (AFR), split-frame rendering (SFR) and multi-GPU anti-aliasing. They each have drawbacks. AFR scales nicely but interacts poorly with inter-frame dependencies. SFR can improve latency but has challenges with offscreen rendering and scaling of vertex processing. With multi-GPU anti-aliasing, each GPU renders the same content with alternate sample positions and the driver blends the result to improve quality. This also has issues with offscreen rendering and can conflict with other anti-aliasing techniques. These issues with implicit multi-GPU rendering all have the same root cause: the driver lacks adequate knowledge to accelerate every application. To resolve this, NV_gpu_multicast provides fine-grained, explicit application control over multiple GPUs with a single context. Key points: - One context controls multiple GPUs. Every GPU in the linked group can access every object. - Rendering is broadcast. Each draw is repeated across all GPUs in the linked group. - Each GPU gets its own instance of all framebuffers, allowing individualized output for each GPU. Input data can be customized for each GPU using buffers created with the storage flag, PER_GPU_STORAGE_BIT_NV and a new API, MulticastBufferSubDataNV. - New interfaces provide mechanisms to transfer textures and buffers from one GPU to another. New Procedures and Functions void RenderGpuMaskNV(bitfield mask); void MulticastBufferSubDataNV( bitfield gpuMask, uint buffer, intptr offset, sizeiptr size, const void *data); void MulticastCopyBufferSubDataNV( uint readGpu, bitfield writeGpuMask, uint readBuffer, uint writeBuffer, intptr readOffset, intptr writeOffset, sizeiptr size); void MulticastCopyImageSubDataNV( uint srcGpu, bitfield dstGpuMask, uint srcName, enum srcTarget, int srcLevel, int srcX, int srcY, int srcZ, uint dstName, enum dstTarget, int dstLevel, int dstX, int dstY, int dstZ, sizei srcWidth, sizei srcHeight, sizei srcDepth); void MulticastBlitFramebufferNV(uint srcGpu, uint dstGpu, int srcX0, int srcY0, int srcX1, int srcY1, int dstX0, int dstY0, int dstX1, int dstY1, bitfield mask, enum filter); void MulticastFramebufferSampleLocationsfvNV(uint gpu, uint framebuffer, uint start, sizei count, const float *v); void MulticastBarrierNV(void); void MulticastWaitSyncNV(uint signalGpu, bitfield waitGpuMask); void MulticastGetQueryObjectivNV(uint gpu, uint id, enum pname, int *params); void MulticastGetQueryObjectuivNV(uint gpu, uint id, enum pname, uint *params); void MulticastGetQueryObjecti64vNV(uint gpu, uint id, enum pname, int64 *params); void MulticastGetQueryObjectui64vNV(uint gpu, uint id, enum pname, uint64 *params); New Tokens Accepted in the parameter of BufferStorage and NamedBufferStorageEXT: PER_GPU_STORAGE_BIT_NV 0x0800 Accepted by the parameter of GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and GetDoublev: MULTICAST_GPUS_NV 0x92BA RENDER_GPU_MASK_NV 0x9558 Accepted as a value for for the TexParameter{if}, TexParameter{if}v, TextureParameter{if}, TextureParameter{if}v, MultiTexParameter{if}EXT and MultiTexParameter{if}vEXT commands and for the parameter of GetTexParameter{if}v, GetTextureParameter{if}vEXT and GetMultiTexParameter{if}vEXT: PER_GPU_STORAGE_NV 0x9548 Accepted by the parameter of GetMultisamplefv: MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV 0x9549 Additions to the OpenGL 4.5 Specification (Compatibility Profile) (Add a new chapter after chapter 19 "Compute Shaders") 20 Multicast Rendering Some implementations support multiple linked GPUs driven by a single context. Often the distribution of work to individual GPUs is managed by the GL without client knowledge. This chapter specifies commands for explicitly distributing work across GPUs in a linked group. Rendering can be enabled or disabled for specific GPUs. Draw commands are multicast, or repeated across all enabled GPUs. Objects are shared by all GPUs, however each GPU has its own instance (copy) of many resources, including framebuffers. When each GPU has its own instance of a resource, it is considered to have per-GPU storage. When all GPUs share a single instance of a resource, this is considered GPU-shared storage. The mechanism for linking GPUs is implementation specific, as is the mechanism for enabling multicast rendering support (if necessary). The number of GPUs usable for multicast rendering by a context can be queried by calling GetIntegerv with the symbolic constant MULTICAST_GPUS_NV. This number is constant for the lifetime of a context. Individual GPUs are identified using zero-based indices in the range [0, n-1], where n is the number of multicast GPUs. GPUs are also identified by bitmasks of the form 2^i, where i is the GPU index. A set of GPUs is specified by the union of masks for each GPU in the set. 20.1 Controlling Individual GPUs Render commands are restricted to a specific set of GPUs with void RenderGpuMaskNV(bitfield mask); The following errors apply to RenderGpuMaskNV: INVALID_OPERATION is generated * if is zero, * if is not zero and is greater than or equal to 2^n, where n is equal to MULTICAST_GPUS_NV, * if issued between BeginConditionalRender and the corresponding EndConditionalRender. If the command does not generate an error, RENDER_GPU_MASK_NV is set to . The default value of RENDER_GPU_MASK_NV is (2^n)-1. Render commands are skipped for a GPU that is not present in RENDER_GPU_MASK_NV. For example: draw calls, clears, compute dispatches, and copies or pixel path operations that write to a framebuffer (e.g. DrawPixels, BlitFramebuffer). For a full list of render commands see section 2.4 (page 26). MulticastBlitFramebufferNV is an exception to this policy: while it is a rendering command, it has its own source and destinations mask. Note that buffer and textures updates are not affected by RENDER_GPU_MASK_NV. 20.2 Multi-GPU Buffer Storage Like other resources, buffer objects can have two types of storage, per-GPU storage or GPU-shared storage. Per-GPU storage can be explicitly requested using the PER_GPU_STORAGE_BIT_NV flag with BufferStorage/NamedBufferStorageEXT. If this flag is not set, the type of storage used is undefined. The implementation may use either type and transition between them at any time. Client reads of a buffer with per-GPU storage may source from any GPU. The following rules apply to buffer objects with per-GPU storage: When mapped updates apply to all GPUs (only WRITE_ONLY access is supported). When used as the write buffer for CopyBufferSubData or CopyNamedBufferSubData, writes apply to all GPUs. The following commands affect storage on all GPUs, even if the buffer object has per-GPU storage: BufferSubData, NamedBufferSubData, ClearBufferSubData, and ClearNamedBufferData An INVALID_VALUE error is generated if BufferStorage/NamedBufferStorageEXT is called with PER_GPU_STORAGE_BIT_NV set with MAP_READ_BIT or SPARSE_STORAGE_BIT_ARB. To modify buffer object data on one or more GPUs, the client may use the command void MulticastBufferSubDataNV( bitfield gpuMask, uint buffer, intptr offset, sizeiptr size, const void *data); This command operates similarly to NamedBufferSubData, except that it updates the per-GPU buffer data on the set of GPUs defined by . If has GPU-shared storage, is ignored and the shared instance of the buffer is updated. An INVALID_VALUE error is generated if is zero or is greater than or equal to 2^n, where n is equal to MULTICAST_GPUS_NV. An INVALID_OPERATION error is generated if is not the name of an existing buffer object. An INVALID_VALUE error is generated if or is negative, or if + is greater than the value of BUFFER_SIZE for the buffer object. An INVALID_OPERATION error is generated if any part of the specified buffer range is mapped with MapBufferRange or MapBuffer (see section 6.3), unless it was mapped with MAP_PERSISTENT_BIT set in the MapBufferRange access flags. An INVALID_OPERATION error is generated if the BUFFER_IMMUTABLE_STORAGE flag of the buffer object is TRUE and the value of BUFFER_STORAGE_FLAGS for the buffer does not have the DYNAMIC_STORAGE_BIT set. To copy between buffers created with PER_GPU_STORAGE_BIT_NV, the client may use the command void MulticastCopyBufferSubDataNV( uint readGpu, bitfield writeGpuMask, uint readBuffer, uint writeBuffer, intptr readOffset, intptr writeOffset, sizeiptr size); This command operates similarly to CopyNamedBufferSubData, while adding control over the source and destination GPU(s). The read GPU index is specified by and the set of write GPUs is specified by the mask in . Implementations may also support this command with buffers not created with PER_GPU_STORAGE_BIT_NV. This support can be determined with one test copy with an error check (see error discussion below). Note that a buffer created without PER_GPU_STORAGE_BIT_NV is considered to have undefined storage and the behavior of the command depends on the storage type (per-GPU or GPU-shared) currently used for . If is using GPU-shared storage, the normal error checks apply but the command behaves as if includes all GPUs. If is using per-GPU storage, the command behaves as if PER_GPU_STORAGE_BIT_NV were set, however performance may be reduced. This following error may apply to MulticastCopyBufferSubDataNV on some implementations and not on others. In earlier revisions of this extension the error was required, therefore applications should perform a test copy using buffers without PER_GPU_STORAGE_BIT_NV before relying on that functionality: An INVALID_OPERATION error is generated if the value of BUFFER_STORAGE_FLAGS for or does not have PER_GPU_STORAGE_BIT_NV set. The following errors apply to MulticastCopyBufferSubDataNV: An INVALID_OPERATION error is generated if or is not the name of an existing buffer object. An INVALID_VALUE error is generated if any of , , or are negative, if + exceeds the size of the source buffer object, or if + exceeds the size of the destination buffer object. An INVALID_OPERATION error is generated if either the source or destination buffer objects is mapped, unless they were mapped with MAP_PERSISTENT_BIT set in the Map*BufferRange access flags. An INVALID_VALUE error is generated if is greater than or equal to MULTICAST_GPUS_NV. An INVALID_OPERATION error is generated if is zero. An INVALID_VALUE error is generated if is not zero and is greater than or equal to 2^n, where n is equal to MULTICAST_GPUS_NV. An INVALID_VALUE error is generated if the source and destination are the same buffer object, is present in , and the ranges [; + ) and [; + ) overlap. 20.3 Multi-GPU Framebuffers and Textures All buffers in the default framebuffer as well as renderbuffers receive per-GPU storage. By default, storage for textures is undefined: it may be per-GPU or GPU-shared and can transition between the types at any time. Per-GPU storage can be specified via [Multi]Tex[ture]Parameter{if}[v] with PER_GPU_STORAGE_NV for the argument and TRUE for the value. For this storage parameter to take effect, it must be specified after the texture object is created and before the texture contents are defined by TexImage*, TexStorage* or TextureStorage*. 20.3.1 Copying Image Data Between GPUs To copy texel data between GPUs, the client may use the command: void MulticastCopyImageSubDataNV( uint srcGpu, bitfield dstGpuMask, uint srcName, enum srcTarget, int srcLevel, int srcX, int srcY, int srcZ, uint dstName, enum dstTarget, int dstLevel, int dstX, int dstY, int dstZ, sizei srcWidth, sizei srcHeight, sizei srcDepth); This command operates equivalently to CopyImageSubData, except that it takes a source GPU and a destination GPU set defined by and (respectively). Texel data is copied from the source GPU to all destination GPUs. The following errors apply to MulticastCopyImageSubDataNV: INVALID_ENUM is generated * if either or - is not RENDERBUFFER or a valid non-proxy texture target - is TEXTURE_BUFFER, or - is one of the cubemap face selectors described in table 3.17, * if the target does not match the type of the object. INVALID_OPERATION is generated * if either object is a texture and the texture is not complete, * if the source and destination formats are not compatible, * if the source and destination number of samples do not match, * if one image is compressed and the other is uncompressed and the block size of compressed image is not equal to the texel size of the compressed image. INVALID_VALUE is generated * if is greater than or equal to MULTICAST_GPUS_NV, * if is zero, * if is greater than or equal to 2^n, where n is equal to MULTICAST_GPUS_NV, * if either or does not correspond to a valid renderbuffer or texture object according to the corresponding target parameter, or * if the specified level is not a valid level for the image, or * if the dimensions of the either subregion exceeds the boundaries of the corresponding image object, or * if the image format is compressed and the dimensions of the subregion fail to meet the alignment constraints of the format. To copy pixel values from one GPU to another use the following command: void MulticastBlitFramebufferNV(uint srcGpu, uint dstGpu, int srcX0, int srcY0, int srcX1, int srcY1, int dstX0, int dstY0, int dstX1, int dstY1, bitfield mask, enum filter); This command operates equivalently to BlitNamedFramebuffer except that it takes a source GPU and a destination GPU defined by and (respectively). Pixel values are copied from the read framebuffer on the source GPU to the draw framebuffer on the destination GPU. In addition to the errors generated by BlitNamedFramebuffer (see listing starting on page 634), calling MulticastBlitFramebufferNV will generate INVALID_VALUE if or is greater than or equal to MULTICAST_GPUS_NV. 20.3.2 Per-GPU Sample Locations Programmable sample locations can be customized for each GPU and framebuffer using the following command: void MulticastFramebufferSampleLocationsfvNV(uint gpu, uint framebuffer, uint start, sizei count, const float *v); An INVALID_OPERATION error is generated by MulticastFramebufferSampleLocationsfvNV if is not the name of an existing framebuffer object. INVALID_VALUE is generated if the sum of and is greater than PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB. An INVALID_VALUE error is generated if is greater than or equal to MULTICAST_GPUS_NV. This is equivalent to FramebufferSampleLocationsfvARB except that it sets MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV at the appropriate offset for the specified GPU. Just as with FramebufferSampleLocationsfvARB, FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_ARB must be enabled for these sample locations to take effect. FramebufferSampleLocationsfvARB and NamedFramebufferSampleLocationsfvARB also set MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV but for the specified sample across all multicast GPUs. If is 0, MulticastFramebufferSampleLocationsfvNV updates PROGRAMMABLE_SAMPLE_LOCATION_ARB in addition to MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV. The programmed sample locations can be retrieved using GetMultisamplefv with set to MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV and indices calculated as follows: index_x = gpu * PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB + 2 * sample_i; index_y = gpu * PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB + 2 * sample_i + 1; 20.4 Interactions with Other Copy Functions Many existing commands can be used to copy between resources with GPU-shared, per-GPU or undefined storage. For example: ReadPixels, GetBufferSubData or TexImage2D with a pixel unpack buffer. The following table defines how the storage of the resource influences the behavior of these copies. Table 20.1 Behavior of Copy Commands with Multi-GPU Storage Source Destination Behavior ---------- ----------- ----------------------------------------------------------------------- GPU-shared GPU-shared There is just one source and one destination. Copy from source to destination. GPU-shared per-GPU There is a single source. Copy it to the destination on all GPUs. GPU-shared undefined Either of the above behaviors for a GPU-shared source may apply. per-GPU GPU-shared Copy from the GPU with the lowest index set in RENDER_GPU_MASK_NV to to the shared destination. per-GPU per-GPU Implementations are encouraged to copy from source to destination separately on each GPU. This is not required. If and when this is not feasible, the copy should source from the GPU with the lowest index set in RENDER_GPU_MASK_NV. per-GPU undefined Either of the above behaviors for a per-GPU source may apply. undefined GPU-shared Either of the above behaviors for a GPU-shared destination may apply. undefined per-GPU Either of the above behaviors for a per-GPU destination may apply. undefined undefined Any of the above behaviors may apply. 20.5 Multi-GPU Synchronization MulticastCopyImageSubDataNV and MulticastCopyBufferSubDataNV each provide implicit synchronization with previous work on the source GPU. MulticastBlitFramebufferNV is different, providing implicit synchronization with previous work on the destination GPU. In both cases, synchronization of the copies can be achieved with calls to the barrier command: void MulticastBarrierNV(void); This is called to block all GPUs until all previous commands have been completed by all GPUs, and all writes have landed. To guarantee consistency, synchronization must be placed between any two accesses by multiple GPUs to the same memory when at least one of the accesses is a write. This includes accesses to both the source and the destination. The safest approach is to call MulticastBarrierNV immediately before and after each copy that involves multiple GPUs. GPU writes and reads to/from GPU-shared locations require synchronization as well. GPU writes such as transform feedback, shader image store, CopyTexImage, CopyBufferSubData are not automatically synchronized with writes by other GPUs. Neither are GPU reads such as texture fetches, shader image loads, CopyTexImage, etc. synchronized with writes by other GPUs. Existing barriers such as TextureBarrier and MemoryBarrier only provide consistency guarantees for rendering, writes and reads on a single GPU. In some cases it may be desirable to have one or more GPUs wait for an operation to complete on another GPU without synchronizing all GPUs with MulticastBarrierNV. This can be performed with the following command: void MulticastWaitSyncNV(uint signalGpu, bitfield waitGpuMask); INVALID_VALUE is generated * if is greater than or equal to MULTICAST_GPUS_NV, * if is zero, * if is greater than or equal to 2^n, where n is equal to MULTICAST_GPUS_NV, or * if is present in . MulticastWaitSyncNV provides the same consistency guarantees as MulticastBarrierNV but only between the GPUs specified by and in a single direction. It forces the GPUs specified by waitGpuMask to wait until the GPU specified by has completed all previous commands and writes associated with those commands. 20.6 Multi-GPU Queries Queries are performed across all multicast GPUs. Each query object stores independent result values for each GPU. The result value for a specific GPU can be queried using one of the following commands: void MulticastGetQueryObjectivNV(uint gpu, uint id, enum pname, int *params); void MulticastGetQueryObjectuivNV(uint gpu, uint id, enum pname, uint *params); void MulticastGetQueryObjecti64vNV(uint gpu, uint id, enum pname, int64 *params); void MulticastGetQueryObjectui64vNV(uint gpu, uint id, enum pname, uint64 *params); The behavior of these commands matches the GetQueryObject* equivalent commands, except they return the result value for the specified GPU. A query may be available on one GPU but not on another, so it may be necessary to check QUERY_RESULT_AVAILABLE for each GPU. GetQueryObject* return query results and availability for GPU 0 only. In addition to the errors generated by GetQueryObject* (see the listing in section 4.2 on page 49), calling MulticastGetQueryObject* will generate INVALID_VALUE if is greater than or equal to MULTICAST_GPUS_NV. Additions to Chapter 8 of the OpenGL 4.5 (Compatibility Profile) Specification (Textures and Samplers) Modify Section 8.10 (Texture Parameters) Insert the following paragraph before Table 8.25 (Texture parameters and their values): If is PER_GPU_STORAGE_NV, then the state is stored in the texture, but only takes effect the next time storage is allocated for a texture using TexImage*, TexStorage* or TextureStorage*. If the value of TEXTURE_IMMUTABLE_FORMAT is TRUE, then PER_GPU_STORAGE_NV cannot be changed and an error is generated. Additions to Table 8.26 Texture parameters and their values Name Type Legal values ------------------ ------- ------------ PER_GPU_STORAGE_NV boolean TRUE, FALSE Additions to Chapter 10 of the OpenGL 4.5 (Compatibility Profile) Specification (Vertex Specification and Drawing Commands) Modify Section 10.9 (Conditional Rendering) Replace the following text: If the result (SAMPLES_PASSED) of the query is zero, or if the result (ANY_SAMPLES_PASSED or ANY_SAMPLES_- PASSED_CONSERVATIVE) is FALSE, all rendering commands described in section 2.4 are discarded and have no effect when issued between BeginConditional- Render and the corresponding EndConditionalRender with this text: For each active render GPU, if the result (SAMPLES_PASSED) of the query on that GPU is zero, or if the result (ANY_SAMPLES_PASSED or ANY_SAMPLES_- PASSED_CONSERVATIVE) is FALSE, all rendering commands described in section 2.4 are discarded by this GPU and have no effect when issued between BeginConditional- Render and the corresponding EndConditionalRender Similarly replace the following: If the result (SAMPLES_PASSED) of the query is non-zero, or if the result (ANY_SAMPLES_PASSED or ANY_SAMPLES_PASSED_- CONSERVATIVE) is TRUE, such commands are not discarded. with this: For each active render GPU, if the result (SAMPLES_PASSED) of the query on that GPU is non-zero, or if the result (ANY_SAMPLES_PASSED or ANY_SAMPLES_PASSED_- CONSERVATIVE) is TRUE, such commands are not discarded. Finally, replace all instances of "the GL" with "each active render GPU". Additions to Chapter 14 of the OpenGL 4.5 (Compatibility Profile) Specification (Fixed-Function Primitive Assembly and Rasterization) Modify Section 14.3.1 (Multisampling) Replace the following text: The location for sample is taken from v[2*(i-start)] and v[2*(i-start)+1]. with the following: These commands set the sample locations for all multicast GPUs in MULTICAST_FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_NV. The location for sample on gpu is taken from v[g*N+2*(i-start)] and v[g*N+2*(i-start)+1]. Replace the following error generated by GetMultisamplefv: An INVALID_ENUM error is generated if is not SAMPLE_LOCATION_ARB or PROGRAMMABLE_SAMPLE_LOCATION_ARB. with the following: An INVALID_ENUM error is generated if is not SAMPLE_LOCATION_ARB, PROGRAMMABLE_SAMPLE_LOCATION_ARB or MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV. Add the following to the list of errors generated by GetMultisamplefv: An INVALID_VALUE error is generated if is MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_ARB and is greater than or equal to the value of PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB multiplied by the value of MULTICAST_GPUS_NV. Replace the following pseudocode (in both locations): float *table = FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_ARB; sample_location.xy = (table[2*sample_i], table[2*sample_i+1]); with the following: float *table = MULTICAST_FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_NV; table += PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB * gpu; sample_location.xy = (table[2*sample_i], table[2*sample_i+1]); Additions to the WGL/GLX/EGL/AGL Specifications None Dependencies on ARB_sample_locations If ARB_sample_locations is not supported, section 20.3.2 and any references to MulticastFramebufferSampleLocationsfvNV and MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV should be removed. The modifications to Section 14.3.1 (Multisampling) should also be removed. Dependencies on ARB_sparse_buffer If ARB_sparse_buffer is not supported, any reference to SPARSE_STORAGE_BIT_ARB should be removed. Interactions with EXT_bindable_uniform When using the functionality of EXT_bindable_uniform and a per-GPU storage buffer is bound to a bindable location in a program object, client uniform updates apply to all GPUs. An INVALID_OPERATION is generated if a buffer with PER_GPU_STORAGE_BIT_NV is bound to a program object's bindable location and GetUniformfv, GetUniformiv, GetUniformuiv or GetUniformdv is called. Errors Relaxation of INVALID_ENUM errors --------------------------------- GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and GetDoublev now accept new tokens as described in the "New Tokens" section. New State Additions to Table 23.4 Rasterization Initial Get Value Type Get Command Value Description Sec. Attribute -------------------------- ------ ----------- ----- ----------------------- ---- --------- RENDER_GPU_MASK_NV Z+ GetIntegerv * Mask of GPUs that have 20.1 - writes enabled * See section 20.1 Additions to Table 23.19 Textures (state per texture object) Initial Get Value Type Get Command Value Description Sec. --------- ---- ----------- ------- ----------- ---- PER_GPU_STORAGE_NV B GetTexParameter FALSE Per-GPU storage requested 20.3 Additions to Table 23.30 Framebuffer (state per framebuffer object) Get Value Get Command Type Initial Value Description Sec. Attribute --------- ----------- ---- ------------- ----------- ---- --------- MULTICAST_PROGRAMMABLE_- GetMultisamplefv * (0.5,0.5) Programmable sample 20.3.2 - SAMPLE_LOCATION_NV * The type here is "2* x n x 2 x R[0,1]" which is is equivalent to PROGRAMMABLE_SAMPLE_LOCATION_ARB but with samples locations for all multicast GPUs (one after the other). New Implementation Dependent State Add to Table 23.82, Implementation-Dependent Values, p. 784 Minimum Get Value Type Get Command Value Description Sec. Attribute ---------------------------- ------ ------------- ----- ---------------------- ---- --------- MULTICAST_GPUS_NV Z+ GetIntegerv 1 Number of linked GPUs 20.0 - usable for multicast Backwards Compatibility This extension replaces NVX_linked_gpu_multicast. The enumerant values for MULTICAST_GPUS_NV and PER_GPU_STORAGE_BIT_NV match those of MAX_LGPU_GPUS_NVX and LGPU_SEPARATE_STORAGE_BIT_NVX (respectively). MulticastBufferSubDataNV, MulticastCopyImageSubDataNV and MulticastBarrierNV behave analog to LGPUNamedBufferSubDataNVX, LGPUCopyImageSubDataNVX and LGPUInterlockNVX (respectively). Sample Code Binocular stereo rendering example using NV_gpu_multicast with single GPU fallback: struct ViewData { GLint viewport_index; GLfloat mvp[16]; GLfloat modelview[16]; }; ViewData leftViewData = { 0, {...}, {...} }; ViewData rightViewData = { 1, {...}, {...} }; GLuint ubo[2]; glCreateBuffers(2, &ubo[0]); if (has_NV_gpu_multicast) { glNamedBufferStorage(ubo[0], size, NULL, GL_PER_GPU_STORAGE_BIT_NV | GL_DYNAMIC_STORAGE_BIT); glMulticastBufferSubDataNV(0x1, ubo[0], 0, size, &leftViewData); glMulticastBufferSubDataNV(0x2, ubo[0], 0, size, &rightViewData); } else { glNamedBufferStorage(ubo[0], size, &leftViewData, 0); glNamedBufferStorage(ubo[1], size, &rightViewData, 0); } glViewportIndexedf(0, 0, 0, 640, 480); // left viewport glViewportIndexedf(1, 640, 0, 640, 480); // right viewport // Vertex shader sets gl_ViewportIndex according to viewport_index in UBO glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); if (has_NV_gpu_multicast) { glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]); drawScene(); // Make GPU 1 wait for glClear above to complete on GPU 0 glMulticastWaitSyncNV(0, 0x2); // Copy right viewport from GPU 1 to GPU 0 glMulticastCopyImageSubDataNV(1, 0x1, renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0, renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0, 640, 480, 1); // Make GPU 0 wait for GPU 1 copy to GPU 0 glMulticastWaitSyncNV(1, 0x1); } else { glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]); drawScene(); glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[1]); drawScene(); } // Both viewports are now present in GPU 0's renderbuffer Issues (1) Should we provide explicit inter-GPU synchronization API? Will this make the implementation easier or harder for the driver and applications? RESOLVED. Yes. A naive implementation of implicit synchronization would simply synchronize the GPUs before and after each copy. Smart implicit synchronization would have to track all APIs that can modify buffers and textures, creating an excessive burden for driver implementation and maintenance. An application can track dependencies more easily and outperform a naive driver implementation using explicit synchronization. (2) How does this extension interact with queries (e.g. occlusion queries)? RESOLVED. Queries are performed separately on each GPU. The standard GetQueryObject* APIs return query results for GPU 0 only. However GetQueryBufferObject* can be used to retrieve query results for all GPUs through a buffer with separate storage (PER_GPU_STORAGE_BIT_NV). (3) Are copy operations controlled by the render mask? RESOLVED. Copies which write to the framebuffer are considered render commands and implicitly controlled by the render mask. Copies between textures and buffers are not considered render commands so they are not influenced by the mask. If masked copies are desired, use MulticastCopyImageSubDataNV, MulticastCopyBufferSubDataNV or MulticastBlitFramebufferNV. These commands explicitly specify the GPU source and destination and are not influenced by the render mask. (4) What happens if the MulticastCopyBufferSubDataNV source and destination buffer is the same? RESOLVED. When the source and destination involve the same GPU, MulticastCopyBufferSubDataNV matches the behavior of CopyBufferSubData: overlapped copies are not allowed and an INVALID_VALUE error results. When the source and destination do not involve the same GPU, overlapping copies are allowed and no error is generated. (5) How does this extension interact with CopyTexImage2D? RESOLVED. The behavior depends on the storage type of the target. See section 20.4. Since CopyTexImage* sources from the framebuffer, the source always has per-GPU storage. (6) Should we provide a mechanism to modify viewports independently for each GPU? RESOLVED. No. This can be achieved using multicast UBOs and ARB_shader_viewport_layer_array. (7) Should we add a present API that automatically displays content from a specific GPU? It could abstract the transport mechanism, copying when necessary. RESOLVED. No. Transfers should be avoided to maximize performance and minimize latency. Minimizing transfers requires application awareness of display connectivity to assign rendering appropriately. Hiding transfers behind an API would also prevent some interesting multi-GPU rendering techniques (e.g. checkerboard-style split rendering). WGL_NV_bridged_display can be used to enable display from multiple GPUs without copies. (8) Should we expose the extension on single-GPU configurations? RESOLVED. Yes, this is recommended. It allows more code sharing between multi-GPU and single-GPU code paths. If there is only one GPU present MULTICAST_GPUS_NV will be 1. It may also be 1 if explicit GPU control is unavailable (e.g. if the active multi-GPU rendering mode prevents it). Note that in revisions 5 and prior of this extension the minimum for MULTICAST_GPUS_NV was 2. (9) Should glGet*BufferParameter* return the PER_GPU_STORAGE_BIT_NV bit when BUFFER_STORAGE_FLAGS is queried? RESOLVED. Yes. BUFFER_STORAGE_FLAGS must match the flags parameter input to *BufferStorage, as specified in table 6.3. (10) Can a query be complete/available on one GPU and not another? RESOLVED. Yes. Independent query completion is important for conditional rendering. It allows each GPU to begin conditional rendering in mode QUERY_WAIT without waiting on other GPUs. (11) How can custom texel data for be uploaded to each GPU for a given texture? The easiest way is to create staging textures with the custom texel data and then copy it to a texture with per-GPU storage using MulticastCopyImageSubDataNV. (12) Should we allow the waitGpuMask in MulticastWaitSyncNV to include the signal GPU? RESOLVED. No. There is no reason for a GPU to wait on itself. This is effectively a no-op in the command stream. Furthermore it is easy to confuse GPU indices and masks, so it is beneficial to explicitly generate an error in this case. (13) Will support for NVX_linked_gpu_multicast continue? RESOLVED. NVX_linked_gpu_multicast is deprecated and applications should switch to NV_gpu_multicast. However, implementations are encouraged to continue supporting NVX_linked_gpu_multicast for backwards compatibility. (14) Does RenderGpuMaskNV work with immediate mode rendering? RESOLVED. Yes, the render GPU mask applies to immediate mode rendering the same as other rendering. Note that RenderGpuMaskNV is not one of the commands allowed between Begin and End (see section 10.7.5) so the render mask must be set before Begin is called. Revision History Rev. Date Author Changes ---- -------- -------- ----------------------------------------------- 7 04/02/19 jschnarr clarify that the interactions with uniform APIs only apply to EXT_bindable_uniform (not ARB_uniform_buffer_object). optionally allow MulticastCopyBufferSubDataNV with buffers lacking per-GPU storage 6 01/03/19 jschnarr reduce MULTICAST_GPUS_NV minimum to 1 clarify that MULTICAST_GPUS_NV is constant for a context 5 10/07/16 jschnarr trivial typo fix 4 07/21/16 mjk registered 3 06/15/16 jschnarr R370 release