The GTX 970’s Memory Explained & Tested
Date: January 27, 2015
On forums near and far, there have been reports users have been experiencing memory allocation issues on NVIDIA’s GTX 970. Much of this centered around the fact that certain applications showed the GTX 970 to be utilizing just 3.5GB of its supposed 4GB of memory even though the GTX 980 and other cards showed their full memory layout as being accessible. There were further reports that once the 3.5GB threshold was surpassed, the GTX 970 suddenly exhibited a drastic loss of performance. It looked suspiciously like NVIDIA’s price / performance darling wasn’t able to physically communicate with its advertised memory allotment and if communication was taking place, that bandwidth was somehow truncated.
Naturally, this sparked a large number of theories regarding the Maxwell architecture, its abilities and how NVIDIA has allocated resources on their $350 graphics card. NVIDIA themselves have now stepped in, trying to set the record straight. What follows is a simplified version of our technical briefing with them alongside some basic benchmarks.
Let’s start with the thousand pound gorilla in the room: the GM204 core as it’s utilized in the GTX 970. In the first image above you will see the basic core layout as NVIDIA originally described it in their documentation and during their briefings to reviewers. There is a trio of SMMs disabled (these can be located within any one of the GPCs) which effectively reduces the number of CUDA cores, texture units, L1 cache and a number of other on-chip resources. However, at the time, the back-end resources seemed to have remained in place with four 64-bit memory controllers, 64 ROPs and 2048KB of L2 cache. We now know this wasn’t the case.
In what was deemed an error in their documentation due to a miscommunication between the engineering and technical PR teams, some fairly significant information was left on the cutting room floor. Instead of utilizing a full 64 ROPs as was originally believed, the GTX 970 only has 56 enabled while the L2 Cache has seen a 256KB cut to 1792KB. Additionally, the cut-down GM204 core handles its memory in a somewhat unique fashion which could explain why so many users are seeing utilization below the 4GB mark. So what happened here? This is where the story, as it was told to us, gets interesting.
NVIDIA’s previous Kepler architecture had the ability to scale downwards in a very linear fashion which meant that when one portion of a chip was disabled, in an effort to retain a balanced design associated functions had to be disabled as well. For example, if a single 64-bit memory controller was kicked to the curb when creating a new part so too was its associated ROP partition and dedicated L2 cache. This scaling is done in all modern GPU architectures in order to optimize yields, with the cores that cannot operate at full capacity being rolled into lower-end SKUs through judicious modifications to their available resources.
Due to the Maxwell architecture’s unique layout, going down this same route would have resulted in a significant performance delta between the GTX 970 and GTX 980 since massive parts of the former’s core would have been disabled en masse. With four memory controllers and just a quartet of ROP groupings, there really wasn’t much leeway for disabling elements before the GTX 970 simply became uncompetitive in its segment.
What we didn’t know until our briefing was exactly how NVIDIA created the GTX 970’s core. In order to maximize performance potential, their engineers gave GM204 more scaling granularity which allows for the partial disabling of certain interfaces without affecting the chip’s communication hierarchy. As a result it could retain a 256-bit memory interface alongside 4GB of GDDR5 memory but features several changes in how those elements handle communication over its pool of shared resources or Crossbar.
To provide sufficient bandwidth throughout the chip, the GM204’s 16 Streaming Multiprocessors each has a dedicated pipeline to the Crossbar. The information is then passed through towards the secondary processing stages via eight high-bandwidth ports, each of which has access to eight associated ROPs, 256KB of shared memory and a single 64-bit memory controller. In the GTX 970, one of these ports has been disabled along with its ROPs and Cache while the memory controller and its companion 500MB of DRAM remain as a separate entity. This results in a dual partition memory structure consisting of a primary 3.5GB segment and a companion 500MB.
In a typical scenario this lonesome memory controller / DRAM duo would have to be disabled as well since without an L2 cache partition it would have no way to communicate with the Crossbar. Instead, NVIDIA applied a so-called “buddy interface” which effectively reroutes the extra DRAM’s communications so an existing cache module can take over and recognize the full 4GB DRAM allocation and 256-bit memory interface.
For lack of a better explanation, on a fully enabled GM204 core the memory ports are accessed in a sequential order in a 1KB stride after which the process repeats itself in a relatively straightforward manner. As the resources associated with the first port are left to finish their task, the workloads move on to the subsequent port, use that and then continue on following a cyclical port-forward 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3… routine. In an optimal situation, once that first port is called upon again, it is free and ready to process new information.
On the GTX 970 part of the resources (the ROPs and L2 cache) normally associated with the eighth port simply aren’t there. So what would have happened if NVIDIA had gone with a full 4GB partition? With calls being rerouted through the buddy interface to an already-used L2 section, there was a very real possibility that a single 256KB cache partition (which in this case is handling the information for two memory controllers instead of one) would create a bottleneck and slow down the whole memory interface.
Think of this situation in terms of the numbers we’ve written above. In an optimal scenario the order would be 1 through 8, after which it repeats through all eight ports and addresses all 4GB of memory. Using a typical striding process on the GTX 970 would have encountered a situation where the cyclical process would be broken and look like 1, 2, 3, 4, 5, 6, 7, 7, 1, 2, 3 and so on.
With that seventh port handling double the requests via the buddy interface and being hit twice as often, the remainder of the strides would have to wait around for it to complete scheduled tasks. In effect, the entire affair could theoretically operate at half speed, dragging down the memory processing pipeline in the process.
To avoid completely bogging down the GTX 970’s bandwidth potential on that seventh and eighth stride, NVIDIA split the memory into two different partitions: the lower segment houses seven 512MB DRAM modules for a total of 3584MB while the higher one has a single associative 512MB IC which is good for 28GB/s of peak bandwidth. For those keeping track at home, a full 7/8ths of the theoretical memory throughput is accessible in that primary partition.
Within the 3.5GB section (red above), the strides are proceed sequentially while leaving the final 500MB of memory out of the equation. Thus, a game that calls for less than 3.5GB of memory will neatly avoid the buddy interface and final 500MB while still having access to 196GB/s of peak bandwidth. Meanwhile anything that requires more than 4GB of memory will cause draw calls to run through the system’s slow PCIe 19GB/s interface, completely hobbling performance due to the added latency. Calling for help from the PCIe bus for access to system memory occurs with every architecture when an onboard memory interface reaches the point of saturation.
The real kicker is what happens in that grey area when Windows calls for a memory allotment of between 3584MB and 4096MB. In those situations NVIDIA’s drivers are supposed to enable the eighth memory controller, final 512MB of DRAM and the buddy interface (the green “8th stride” above), opening up the full 4GB of on-card memory. However, that final 500MB remains set apart from the larger 3.5GB partition since both cannot be read at the same time despite the thin thread of communication provided by the so-called buddy interface. Since that 500MB partition has a very limited bandwidth of just 28GB/s, if the architecture spends too much time reading from it, the overall effective throughput of the larger 3.5GB segment would be negatively affected as well.
According to NVIDIA, there are checks and balances in place to insure the GPU core never gets hung up in waiting for on-die memory resources to complete their scheduled tasks. One of the first lines of defense is a driver algorithm that is supposed to effectively allocate resources, and balance loads so draw calls follow the most efficient path and do not prematurely saturate an already-utilized Crossbar port. This means in situations where between 3.5GB and 4GB of memory is required, data that isn’t used as often is directed towards the slower 500MB partition while the faster 3.5GB section can continue along processing quick-access reads and writes.
From an architectural perspective, there’s also extra read and write request bandwidth between the memory controllers and the L2 caching hierarchy.
Another part of this delicate dance includes the interleaving of reads and writes so when one section of memory is processing reads, the other is free to process writes. When combined with the elements we’ve already discussed the interleaving allows the GTX 970 to deliver its stated 224GB/s of peak bandwidth provided the software layer works as its supposed to.
While this unique layout gave NVIDIA the ability to load up the GTX 970 with a full 4GB of memory the technology in play here certainly isn’t infallible. If the drivers are working the way they’re supposed to, there should only be a few percentage points difference (5% or less) between a card like the GTX 970 with two memory partitions and one with a single 4GB allocation when both are being used in scenarios which require higher bandwidth. However much like in other scenarios where software and compatibility plays a role in overall performance, we may see results varying from one application to the next. In the worst case scenarios, only the 3.5GB partition may be recognized or the load balancing algorithm could effectively direct data to the wrong resources.
We’ll get into a few benchmarks and further explain some of the odd behavior users have been experiencing on the next page
- Some Pertinent Benchmarks & Closing Thoughts