State information in a processor is managed using a lookup table that has multiple memory circuits, each with multiple entries. Items of state information belonging to a current state version are stored in a first group of entries in the memory circuits. To create an updated state version, a virtual copy of each of the items of state information is created in a second group of entries in the memory circuits, and the virtual copy of the item being updated is replaced with a real copy of the item from the first group of entries. The item in the first group of entries is then updated.
Parallel Copying Scheme For Creating Multiple Versions Of State Information
State information in a processor is managed using a lookup table that has multiple memory circuits, each with multiple entries. Items of state information belonging to a first state version are stored in a first group of the entries, with each entry in the first group being in a different one of the memory circuits. To create an updated state version, the items of state information are copied in parallel from the first group of entries to a second group of entries, with each entry in the second group is in a different one of the memory circuits. The copy in the second group of the item being updated is then replaced with the updated value.
Multithreaded Simd Parallel Processor With Loading Of Groups Of Threads
In a multithreaded processing core, groups of threads are executed using single instruction, multiple data (SIMD) parallelism by a set of parallel processing engines. Input data defining objects to be processed received as a stream of input data blocks, and the input data blocks are loaded into a local register file in the core such that all of the data for one of the input objects is accessible to one of the processing engines. The input data can be loaded directly into the local register file, or the data can be accumulated in a buffer and loaded after accumulation, for instance during a launch operation for a SIMD group. Shared input data can also be loaded into a shared memory in the processing core.
On-The-Fly Reordering Of Multi-Cycle Data Transfers
A system of processing data in a graphics processing unit having a core configured to process data in hexadecimal form and other graphics modules configured to process data in quads includes a transpose buffer with a crossbar to reorganize incoming data, several memory banks to store the reorganized data over a period of several clock cycles, and a second crossbar for reorganizing the stored data after it is read from the bank of memories in one clock cycle. The method for converting between data in hexadecimal form and data in quads includes providing data in hexadecimal form, reorganizing the data provided in hexadecimal form, storing the reorganized data in several memories, and reading several of the memory locations, which contain all of the elements of the quad, in one clock cycle.
Systems and methods for converting graphics data represented in a hexadecimal form into a quad form may be used to reorganize the graphics data for performing raster operations. Prior to performing raster operations the graphics data received for each component is assembled to interleave the components for each pixel as needed to perform the raster operations. The assembly process varies depending on the number of bits per component, the number of components to be processed, and the memory format of the render target used to store the processed graphics data.
Apparatus, System, And Method For Coalescing Parallel Memory Requests
Bryon S. Nordquist - Santa Clara CA, US Stephen D. Lew - Sunnyvale CA, US
Assignee:
Nvidia Corporation - Santa Clara CA
International Classification:
G06F 15/16 G06F 15/80 G06F 13/00
US Classification:
345502, 345531
Abstract:
A multiprocessor system executes parallel threads. A controller receives memory requests from the parallel threads and coalesces the memory requests to improve memory transfer efficiency.
On-The-Fly Reordering Of 32-Bit Per Component Texture Images In A Multi-Cycle Data Transfer
A system of processing data in a graphics processing unit having a core configured to process data in hexadecimal form and other graphics modules configured to process data in quads includes a transpose buffer with a crossbar to reorganize incoming data, several memory banks to store the reorganized data over a period of several clock cycles, and a second crossbar for reorganizing the stored data after it is read from the bank of memories in one clock cycle. The method for converting between data in hexadecimal form and data in quads includes providing data in hexadecimal form, reorganizing the data provided in hexadecimal form, storing the reorganized data in several memories, and reading several of the memory locations, which contain all of the elements of the quad, in one clock cycle.
Off-Chip Out Of Order Memory Allocation For A Unified Shader
Systems and methods for dynamically allocating memory for thread processing may reduce memory requirements while maintaining thread processing parallelism. A memory pool is allocated to store data for processing multiple threads that does not need to be large enough to dedicate a fixed size portion of the memory pool to each thread that may be processed in parallel. Fixed size portions of the memory pool are dynamically allocated and deallocated to each processing thread. Different fixed size portions may be used for different types of threads to allow greater thread parallelism compared with a system that requires allocating a single fixed portion of the memory pool to each thread. The memory pool may be shared between all of the thread types or divided to provide separate memory pools dedicated to each particular thread type.