The NVIDIA GeForce GTX 1080 Review

Editor-in-Chief

Author: SKYMTL
Date: May 16, 2016
Product Name: GTX 1080 Founders Edition
Warranty: 3 Years

A week ago we covered the announcement of NVIDIA’s GeForce GTX 1080, a card which uses the new Pascal microarchitecture alongside a long list of other performance-enhancing features. What NVIDIA showed was nothing short of a quantum leap forward when compared against other GPU launches from over the last 7 years. Indeed, the last time anyone saw this kind of generational shift was when Fermi was initially launched to replace the respected 200 series. For an architecture that is supposed to supersede the wildly successful Maxwell, Pascal has a lot to live up to.While Pascal and its broad range of capabilities has been discussed at length throughout the last year or so, the GTX 1080 and its 7.2 billion transistor GP104 core represent the first time it has been rolled into a gaming-focused product. That’s a particularly important distinction since Pascal’s initial reveal consisted of the massive compute powerhouse which is the 15 billion transistor P100 core. In its move towards there have been some fundamental changes which we’ll go into a bit later but its suffice to say the GTX 1080 represents a leaner, more efficient and substantially easier-to-produce iteration of Pascal.

Before I get too far into this review, there’s a need to discuss how the GTX 1080 breaks from a typical NVIDIA product unveiling. Ever since the GTX 480 there has been a slow but steady improvement of 25% to 35% when shifting from one graphics card generation to another. It happened between the GTX 480 and GTX 580, again between the GTX 580 and GTX 680 and so on right up until now. Even NVIDIA’s own documentation tended to compare new architectures to products from two year previously since that tended to be the natural progression for gamers’ system updates.

This time the Maxwell to Pascal upgrade path will net users some pretty substantial benefits (a claimed 60% in typical gaming scenarios) but the Kepler to Pascal move promises to be truly incredible with a mind-bending potential 110% performance boost if you’re looking to change out that GTX 780. Essentially, with the GTX 1080 NVIDIA has moved the logical performance yardsticks forward two whole years by acting as a direct competitor to the GTX 980 Ti rather than the GTX 980.

Much of this potential has been realized through the use of TSMC’s 16nm FinFET process. While the CPU segment has seen a gradual refinement of Intel’s tick / tock cadence of a node shrink followed by a new architecture, this market hasn’t seen GPU manufacturing process refinement since sometime in 2009 when 28nm was first used. And the benefits of 16nm are pretty straightforward: it allows for more transistors to be packed into a condensed space while also offering notable steps forward in terms of efficiency and clock speeds.

Even though I’ve discussed the GTX 1080’s specifications in our initial announcement coverage, some information was still under NDA until today. Starting at the top, this card has 2560 CUDA cores and 160 Texture Units, both of which represent substantial enhancements over the GTX 980. One area the GP104 core seems oddly lacking is with the number of ROPs it has access to; at just 64, that’s no more than the GTX 980 and a huge step back from what was available on the GM110.

Some may wonder how these relatively pedestrian specs lead to a card that is supposed to offer the groundbreaking performance I discussed above. Simply put, there are quite a few architectural enhancements going on behind the scenes (more about those on the next page!) and NVIDIA has been able to hit processing stage frequencies that are just this side of insane. Higher core speeds means information gets moved through the core at a faster pace, boosting overall performance metrics.

This is where that 16nm process enters into the equation as well. Due to the inherent efficiency it ushers in for NVIDIA’s already-frugal Pascal architecture, there was a massive amount of frequency overhead to play with before the GP104 core hit its TDP target. NVIDIA is also claiming that most, if not all, of the GTX 1080 cards will be able to overclock past the 2GHz mark which points towards the baseline specifications being quite conservative as well.

Likely the largest departure away from what many were expecting is the type of memory NVIDIA decided to use for the GTX 1080. Rather than the HBM modules AMD has saddled their Fiji cards with or the HBM2 used on P100, we’re seeing the first implementation of Micron’s GDDR5X.

Despite all of the attention being given to next generation standards, GDDR5X acts as an excellent bridge solution between GDDR5 and HMB. It offers substantially better power consumption numbers and bandwidth than GDDR5 yet it doesn’t come saddled with the costs HBM. Instead of being tied to first generation HMB’s low yields, middling capacity and complicated interposer-based design, NVIDIA looked elsewhere despite the technology’s potential upside of massive bandwidth. Through the use of effective color compression algorithms and an extremely fast 10Gbps transfer rate of GDDR5X, there’s supposedly no worry of bottlenecks from the memory subsystem.

This has all been accomplished while the GTX 1080 retains a TDP of just 180W. Think on this number for a moment; in one swift move, NVIDIA has basically doubled performance while increasing power consumption by about 10%. 180W makes me wonder if we’ll see a slightly more efficient version of this card in the notebook space sooner than most expect. Remember, the desktop GTX 980 has been available to for system integrators’ gaming notebooks for months now and the GTX 1080’s metrics are quite similar.

Among all of this excitement NVIDIA also announced the GTX 1070 though didn’t provide much in the way of information about it. Based on pricing alone, we’ll assume it utilizes the same GP104 core yet in a cut-down form but more details will be available closer to its launch date in June.

All of these aspects combine to make the GTX 1080 the most powerful single GPU graphics processor ever created. It is meant to overpower even the TITAN X while also outpacing pretty much everything AMD currently has. The GTX 1070 won’t be all that far behind since it is expected to trade blows with the TITAN X.

While the performance delta between something like the $999 TITAN X and this new card isn’t massive (about 10% to 15% or more in DX12 scenarios), the ramifications for NVIDIA’s entire GPU lineup are far-reaching. We expect the GTX 980 Ti and TITAN X to move quickly towards end of life status, though without any substantial price cuts since their relative inventory levels are presently quite low. Meanwhile, depending upon how quickly other Pascal-based cards are released, we may see the GTX 980 and GTX 970 remain around for a little while longer though at reduced price points.

Understanding the Founders Edition

Among all of the information NVIDIA has given out thus far, there was a bit of confusion among the press and our readers after the so-called Founders Edition was announced. At first we believed it to be a GTX 1080 with additional goodies and perhaps a binned core or at least something to warrant its price of $599. Well it turns out that the Founders Edition is in fact the artist formerly known as the Reference Card and yes, NVIDIA is treating it as a premium product this time around.

While it may be considered a reference card by another name, there is more to the Founders Edition than the standard plastic heatsink shroud, bargain-basement components and loud fan that used to grace such things. NVIDIA’s machined aluminum shroud, side-mounted LED, vapor chamber-based heatsink design and a full backplate are hallmarks of this card. Then again, they’ve been included in various forms on the GTX 980 Ti, GTX 780Ti and TITAN cards as well so the inclusions aren’t exactly unique either.

Other than those additions, that Founders Edition will also be equipped with an exclusive streamlined yet highly advanced power distribution system. Unlike previous designs, it is specifically engineered to enhance loadline ripple suppression thus lowering the amount of power loss between the PSU and GPU core while also enhancing overall efficiency.

But will this be enough to sway This could prove to be a huge challenge indeed within a market which has been spoon-fed for years with marketing points stating blower style coolers are inferior to their downdraft cousins.

With the Founders Edition NVIDIA are setting a precedent whereupon a standard-clocked GPU with a blower-style cooler commands a premium. However board partners are expected to sell their own designs alongside NVIDIA’s yet at lower price points. This situation could very well lead to higher performing pre-overclocked products commanding stratospheric prices or buyers looking at anything priced below the Founders Edition as an inferior product. Conversely, if a board partner decides to undercut NVIDIA’s $699 price yet offer enhanced components and higher clock speeds, the Founders Edition could be in a spot of trouble.

As such, more than a few questions remain. Do its features make the GTX 1080 a $699 graphics card instead of one costing $599? How many SKUs will actually hit that magical $599 price? With board partners selling what amounts to a “premium” NVIDIA-designed product, how will they upsell their own wares? I personally think NVIDIA is overreaching on this one but I also have to wonder what this situation will do to the GTX 1080’s future pricing structure. We could very well see a few $599 GTX 1080 SKUs launched and then slowly fade into obscurity.

Availability……With a Possible Twist

First and foremost while today marks the official NDA lift for the GTX 1080, you won’t be able to buy one so we can effectively state that this is NVIDIA’s first paper launch in quite a while. Considering the competition is still months away from announcing details of their upcoming architecture, availability will only come around on May 27th. Unfortunately, there’s more to it than what first meets the eye.

I asked above whether or not the Founders Edition’s features will convince buyers to shell out $100 more and the answer may be a bitter pill to swallow: come launch day on the 27th, you may not have a choice. Supposedly the Founders Edition may be the only version initially shipping to retailers and while NVIDIA assured us they are “actively working with board partners to insure the $599 price point is met”, we will see how quickly that will happen. Much will be riding this point alone.

A Closer Look at the GTX 1080 Founders Edition

Regardless of what you think about the intent behind NVIDIA’s Founder’s Edition, there’s no denying it is one great looking card. With a machined aluminum heatsink shroud, the GeForce lineup’s typical acrylic window and plenty of angles, we can almost understand why it’s selling for a premium. However, when placed next to previous cards like the GTX 980 Ti, GTX 780 Ti and various TITAN models, you can see where the inspiration came from. At around 10.75” long it matches other reference models while the illuminated GeForce logo is identical to its predecessors as well.

Around back the GTX 1080 has full coverage backplate which further helps dissipate heat since there are actually supposed to be heatpads located on its underside. Here the design takes a pretty interesting departure with linear striations etched into the backplate, giving it a minor texture.

The GXT 1080’s rear area between the stiffening plate and its heatsink shroud is left open for additional airflow towards the fan assembly. It’s also interesting to see how from some angles this graphics card looks like a stealth fighter.

Perhaps one of the most-talked about features during NVIDIA’s announcement event was the GTX 1080’s miserly power consumption figures. At just 180W it requires just a single 8-pin connector, though custom board partner designs will likely expand upon this for both practical and marketing purposes.

Even though the SLI interface has been upgraded within the architecture, its physical layout hasn’t been changed one iota. Thus, it is backwards compatible with older bridges while also supporting the new rigid HB-SLI variety. We’ll talk about this more in a later section.

The rear I/O area may look identical to previous cards but it has a boatload of new functionality. There are three DisplayPort 1.4 outputs and single connectors for HDMI 2.0b and dual link DVI. That means the maximum resolution would be 7680×4320 at 60Hz and 4KHDR is fully supported.

Moving under the heatsink and we see a slightly revised vapor chamber-based design that has been further optimized to improve internal airflow and promote a cooler running core. Even though the actual TDP of this card is quite low, we can’t forget that its thermals are packed into a much smaller, denser area. That means its heat is projected in a more directed pattern, increasing the demands upon the heatsink’s dissipation properties.

NVIDIA claims the components on this card are specifically designed for optimum efficiency but we can’t speak to the veracity of that claim. However, it looks like there’s a standard digital 5+1 PWM design with isolated chokes to cut down on whine. Meanwhile the GDDR5X is set up in such a way that each module has direct access to a memory controller for optimum efficiency.

SLI Revisions

SLI has been synonymous with NVIDIA for more than a decade now but things are changing and that’s a good thing since this standard has been a bit stagnant for a good amount of time now. However, let’s start this section off by preemptively stating that NVIDIA isn’t killing triple and quad SLI support. Rather they are being somewhat forced towards some changes by prevailing market conditions and the future of graphics APIs.

In order to see where NVIDIA is going with everything I’m going to describe below, its important to understand how the data paths work between more than three cards. In a three and four card setup, the information from the lower GPUs had to be passed through the “higher” ones in a kind of Rube Goldbergian-style setup since the topmost “master” GPU was the one outputting the rendered frames towards the display.

Since Pascal is capable of processing such a high amount of information and with the advent of UHD HDR, 5K displays and high resolution multi monitor setups, its entire SLI interface needed an overhaul. The PCI-E interface couldn’t be used since even in its 3.0 iteration there was a very real chance of saturating it with both CPU calls and inter-GPU communication. Meanwhile, the existing interconnect on Maxwell and previous generations just couldn’t keep up with the requirements.

Naturally NVIDIA needed a solution so they simply upgraded their existing setup to the point where it is now capable of doubling the legacy connection’s bandwidth. Thus the so-called High Bandwidth SLI (or SLI HB) bridge was created. It is capable of interfacing with two GPUs across a dual link parallelized interconnect running at 650MHz compared to the single channel 400MHz of flexible SLI bridges. On the positive side of things, backwards compatibility of a sort was retained (more on this below) with existing bridges but that doesn’t mean there wasn’t any collateral damage.

With the advent of UHD HDR, 5K displays and high resolution multi monitor setups that high bandwidth connection will be necessary to deliver an optimal gaming experience on the GTX 1080. The older SLI bridges are still fully compatible but they may encounter understandable bottlenecks when trying to operate in scenarios where massive amounts of information are passed back and forth. Meanwhile, the so-called present generation rigid LED bridges are recommended (again, not required) for anything up to 4K resolution gaming since when used with a Pascal card, their effective clock speed will be increased to 650MHz though only they will still operate in single channel mode.

This brings us to the HB SLI Bridge which is built for next generation display standards or higher resolution surround setups. The graph above shows one usage scenario with a pair of cards running the new / old SLI bridges while gaming on an 11520×2160 (triple 4K) surround setup. While this situation is truly niche and I doubt even a pair of GTX 1080 cards would achieve acceptable framerates without a significant downgrade in image quality, there is a difference between the two results. The older SLI bridge exhibits quite a few frametime spikes while the newer HB SLI connection does without most of the drama. Honestly though, the possibility you will notice the difference between 5ms spikes and ones which extend to about 10ms is debatable though it is completely chartable as evidenced by NVIDIA’s results above.

The Enthusiast Key – A 3/4-Way SLI Ecosystem Locked Away

Now we come to the crux of the matter: with a rigid 2-way connector, what the hell is NVIDIA doing with triple and quad card SLI? Like the interface changes above this will likely require a relatively long-winded explanation.

Now for the longer explanation and for that I’m going to have to get into the current situation with multiple GPUs, of which there are three modes present in the changes Microsoft has made to DX12: MDA, LDA Implicit and LDA Explicit. MDA or Multi Display Adapter was implemented to give developers more control over how and where GPU resources are made available, though each GPU’s memory pool remains independent. It also allows for different GPUs to be utilized with one another since they aren’t directly linked together with a bridge. Rather they communicate with one another through a secondary interface like the PCIe bus. NVIDIA is fully supporting this standard going forward. Since MDA’s functions and its interaction between the application and GPU is fully controlled by the developer it could represent a glimmer of hope for anyone who hopes to use more than two Pascal GPUs.

LDA Implicit is where the current iteration of SLI resides. Within its walled garden, the GPU memory pools can interact with one another so they appear to be one large pool of memory to the OS and associated applications. Unlike MDA, Linked Display Adapter Implicit gives the display adapter’s driver stack much of the control over GPU functions and this is where NVIDIA believes they can best affect the experiences of their users. Meanwhile, they don’t support Explicit LDA since like MDA it grants the developer more control over the GPUs’ functions but requires there be a direct link in place.

While NVIDIA would love to sell you more than two $700 GPUs, they’ve found current processors and game technology just can’t keep up with what Pascal has to offer. As a result, positive performance scaling improvements past two cards will be extremely rare and will certainly not happen as consistently as users would expect.

With all of this taken into account, NVIDIA still allows for more three or four to be used in SLI but they will no longer be recommending gamers use more than two cards, nor will they supporting larger setups on Pascal cards. Legacy products in the Maxwell and Kepler products will still be supported though.

If you do want to run more than two Pascal-based cards over SLI (as stated above, MDA is another matter), you’ll need to request something called an Enthusiast Key from NVIDIA. In order to do so, there will be a few steps. No, really.

The first step will be to run an app locally to generate a “signature” for your GPU and then use that to request an Enthusiast Key from the NVIDIA Enthusiast Key website. That key can then be used to unlock 3 and 4-way SLI functionality for Pascal cards (again, older products won’t need to go through this process). Trust me, you aren’t the only one throwing your hands up in the air right now since I also think this is patently ridiculous.

So is NVIDIA blocking or stopping 3+ card SLI? Yes and no. On one hand they won’t be actively developing driver patches in a fruitless effort to optimize triple and quad card performance on upcoming games that are actively working to dissuade such setups. However, NVIDIA taking a hands off approach shouldn’t be taken as a bout of ambivalence either. There may be good reason for this new stance: with DX12’s unique Explicit Multiadapter feature allowing for more multi-GPU flexibility developers now have the ability to build additional functionality for the likes triple-SLI directly into their appications. It has me wondering whether maybe, just maybe NVIDIA is doing the right thing here by focusing resources on their core ecosystem while putting the onus on developers to properly implement support for higher levels of multi card support. Time will tell I guess…..

NVIDIA’s GP104 Architecture; A Deep Dive

When designing Pascal, NVIDIA kept an eye on the future while insuring their engineers kept a firm foothold in the past. So like a new bride, the P100 core has something old, something new and something borrowed. Naturally calling this particular version of Pascal an all-new architecture would be a misnomer since there are plenty of elements carried over from Maxwell. The end result is a GP104 core that exceeded initial performance targets and maintains very good yields despite a new manufacturing process. In many ways this exemplifies NVIDIA’s approach to gaming-oriented GPU design: use high-margin Tesla products as a proving ground for the future GeForce evolutionary path.

Like Maxwell, the GP104 is broken down into individual Streaming Multiprocessors (or SM’s), each containing 128 CUDA cores, two blocks of eight texture units, 256 KB of register file capacity, a 96 KB shared memory unit and 48 KB of total L1 cache storage. When taken at face value, nothing of note has really changed here but there have been a ton of optimizations built into these units to enhance parallel workload routing, lower latency and boost processing efficiency.

Another interesting change is that the typical PolyMorph Engine –which contains various geometry stages like the Tessellation Unit and Vertex Fetch- has been enhanced with the inclusion of an adaptable Simultaneous Multi Projection unit. This is responsible for generating multiple projections of a single geometry stream, as it enters the SMP engine from upstream shader stages. We’ll get more into this a bit later in the review but for now let’s just say this addition will carry Pascal into next generation display standards and contributes in a big way to NVIDIA’s performance claims over Maxwell.

Those Streaming Multiprocessor blocks are paired up and then combined with an aforementioned PolyMorph Engine to create a Texture Processing Cluster. Within the architecture each of these TPCs with 256 cores and 16 texture units is considered an independent processing unit and can be called upon to process workloads ranging from compute to graphics. Since there are 20 of these SM’s within the GP104 (versus the 16 within GM104) the Pascal architecture has an incredible amount of process routing granularity so one section of the core can handle different workloads without overly impacting game performance. This is particularly important for items such as Asynchronous Compute, physics processing and other natively parallel workloads.

Even from a 35,000 foot view down towards the core design, there haven’t been any fundamental changes to the functional blocks when comparing this particular Pascal iteration with Maxwell. Whereas P100 essentially split the Maxwell SM hierarchy in an effort to include more compute facing-logic stages like shared memory, additional load/store units, HBM2, NVLink and expanded register file sizes, GP104 is essentially a GM110 core shrunken down on the 16nm FinFET process node. That isn’t to say that NVIDIA simply took Maxwell, boosted clocks and called it a day. Quite the contrary since there are more than a few improvements baked into the Pascal architecture which will have a substantial impact upon everything from DX12 to VR environments.

Despite all the similarities, there are some minor departures from the GM104. Instead of four streaming multiprocessors per compute cluster, there are now five which bumps up the CUDA core count by 512 and the TMUs by 32. In addition, since all of Pascal’s processing stages run at a higher frequency than Maxwell, they are of also able to process more information.

This brings us to the secondary processing stages consisting of the L2 cache and ROP structure. Despite boosting the number of units which feed into this section by over 20%, neither has received a paralleling increase in capacity. There are supposedly some minor optimizations which help remove any unwanted caching or ROP-associated bottlenecks but we have to wonder if they have been eliminated altogether.

Even though NVIDIA has effectively added quite a few elements to the GP104 core, the 16nm FinFET manufacturing process has allowed them to optimize die size while also keeping TDP relatively constant. Whereas the previous GM104 boasted 5.2 billion transistors spread over a 398mm² die, all of the GP104’s advances have been built into a core which has a total of 7.2 billion transistors yet crams them into an area measuring just 314mm².

Naturally, with so much talk about High Bandwidth Memory many expected NVIDIA to utilize the standard for their upcoming Pascal parts. The problem with that approach is that HBM1 is still in limited supply, the modules are complicated to implement, capacity is limited and its advantage versus competing standards is nebulous at best. HBM2 meanwhile won’t be widely available for another six months at the earliest. With GDDR5 already hitting peak bandwidth numbers, NVIDIA chose to go down the path less travelled: they chose Micron’s new GDDR5X for the GTX 1080.

With its ability to hit 10Gbps (and higher) speeds without consuming any more power than 7Gbps GDDR5 modules, GDDR5X looks like a perfect bridge solution between traditional GDDR5 modules and true next generation standards. Plus, since it is loosely based off of existing technology, Micron has been able to rapidly ramp up inventory in preparation for the GTX 1080’s perceived popularity.

There has also been a change in Pascal’s underlying memory controller layout instead of a simple 4×64-bit partitioning, the GP104 utilizes a unique 8×32-bit design. This should allow for better load balancing algorithms while also insuring additional scaling granularity as NVIDIA modifies this core to fit other price points.

Similarities between GDDR5 and GDDR5X abound but actually getting the core and memory to play ball together wasn’t a cake walk. The high speeds delivered by the modules necessitated a new circuit architecture between the modules and GPU core utilizing extremely fine-grain fabrication methods. These advances should also provide benefits for standard GDDR5 modules as well which is good news considering they will be utilized in other Pascal-based products.

Memory Compression Goes Extreme

Every one of NVIDIA’s last few graphics architectures starting with Fermi has featured some level of lossless delta color compression for optimizing memory bandwidth. The benefits of compression are multi-faceted: it reduces the amount of information transferred between on-die clients like the TMUs / frame buffer, streamlines L2 cache utilization and minimizes how much data gets written to the memory itself.

While Maxwell improved upon its predecessor with its 2:1 compression ratio, Pascal goes even further than that. NVIDIA thought it was important to not only increase memory bandwidth with GDDR5X but also enhance optimizations built into their architecture with an eye towards boosting efficiency even more.

To accomplish this, Pascal once again has a 2:1 (meaning an optimized packet of information is ½ the original’s size) algorithm built in but it has been enhanced to be effective across a wider swath of usage cases. There’s also finer-grain 4:1 and 8:1 compression rates built into the architecture for further lossless optimization provided a game supports it.

In the example above, Project Cars is used to demonstrate how while Maxwell was effective of compressing most of the scene (the areas in purple), Pascal’s new algorithms can take things to the next level without any negative impact to image quality.

When combined with the additional bandwidth overhead granted by the GDDR5X memory, the GTX 1080’s total effective bandwidth has been multiplied by an order of magnitude. More importantly, the color compression algorithms are able to reduce the net amount of bandwidth required by games, leaving more overhead for outlier situations and even compute tasks.

Taking Asynchronous Compute to the Next Level

With DX12 on the horizon one of the primary criticisms leveled against Maxwell was its relatively poor performance in situations which demanded the core process multiple independent –or asynchronous- graphics and compute workloads. Things like physics, audio, AI, VR and post-processing can all fall under this umbrella.

For optimal asynchronous throughput, the GPU must not only be able to switch between workloads quickly but it also needs the ability to fully utilize its resources without any of them sitting idle. This was particularly challenging for Maxwell. Although it had hardware-level schedulers (NVIDIA wasn’t using a predominantly software-based solution), they weren’t efficient in some key scenarios, as highlighted by the dual compute and graphics workload in Ashes of the Singularity. It won’t be just this game either since DX12 pushes developers to engage the GPU’s highly parallel nature, allowing them to attain significant speedups when asynchronous threads are properly implemented.

Pascal changes things in a number of ways. First and foremost the hardware schedulers have been updated so they are able to reroute requests at a much quicker pace. They also have a certain amount of forward branch prediction built into their framework so once on-die resources are available, there’s data already coming down the pipeline.

Another feature that has been added is Dynamic Load Balancing. With Maxwell’s GPU static partitioning, the GPU and compute workloads were run concurrently on separate dedicated partitions. However both workloads needed to complete at the same time for this method to be efficient, otherwise if one finished before the other a portion of the GPU would sit idle. These idle cycles can eventually build up and negate any performance benefit derived from running parallel operations. Pascal’s hardware-based DLB further engages those schedulers so they can offer a hybridized approach to load balancing. As such the partitions can be dynamically utilized for either compute or graphics workloads, eliminating idle time and boosting performance.

Running asynchronous operations goes far beyond partitioning as well since within each of those situations described above there’s thousands, even millions of operations and workloads being run, some of which are more mission critical than others. Actually determining which ones are prioritized and need to be front-ended in the queue is exceedingly difficult. NVIDIA gave a good example of this during briefings stating there could be an asynchronous timewarp operation which needs to complete before scanout starts or a frame will be dropped.

Traditionally preemption was completed at a very high level within the graphics pipeline could lead to a logjam of data at lower levels. In order to recognize, rationalize and prioritize packets of data or operations, NVIDIA has implemented different levels of preemption. Pixel Level Preemption is a first for modern GPUs and allows each GPC to log progress in a certain task so if a preempt is requested, it can save context information, process the critical information, and then start off where the original task was preempted. This can all happen in just 100 microseconds.

Compute Preemption is being added as well and it acts very much like Pixel Level Preemption, though instead of acting at the pixel level it can preempt workloads at the thread level. When combined with PLP, this feature can allow Pascal to switch workloads at a ridiculously fast pace and enhance performance in situations where parallel asynchronous adaptability is needed.

In addition to standard Compute Preemption done at the thread level, Pascal adds another wrinkle into the fabric by being the first architecture to impalement Instruction Level Preemption into the Compute Preemption stack. While this can be utilized solely for CUDA-based tasks like PhysX, VR and other GameWorks enhancements, its benefits are far-reaching. As NVIDIA puts it: In this mode of operation, when a preemption request is received, all thread processing stops at the current instruction and the state is switched out immediately. This mode of operation involves substantially more state information, because all the registers of every running thread must be saved, but this is the most robust approach for general GPU compute workloads that may have substantial per-thread runtimes.

Simultaneous Multi-Projection & HDR – A Display Revolution?

In today’s world of evolving display standards and different ways of showing viewers brand new worlds, traditional flat panel displays are being supplemented by newer technologies that necessitate some updates to the way GPUs handle their workloads.

While many modern GPU architectures like Maxwell have support for things like VR, AR, surround, curved and other scenarios, the way their rendering algorithms are structured causes quite a few inefficiencies. This adds to frame times and could deliver a sub-optimal experience for users. In some cases this semi-compatibility could either require multiple rendering passes, or rendering with overdraw and then warping the image to match
the display, or both.

Pascal changes this equation in a big way by the addition of a Simultaneous Multi-Projection Engine functional block into its PolyMorph Engine’s structure. Whereas Maxwell had a strictly limited multi-resolution capability which could either flip a projection or take a single projection direction and proportionally scale the resolution in subregions of the screen, the SMP engine in Pascal can single-handedly process the geometry required for up to 32 concurrent projections. All of this is done with little to no application overhead.

One of the key design considerations for the SMP Engine was its placement within Pascal’s typical workflow environment. Since it s functionality is placed after the geometry pipeline the application saves all the
work that would otherwise need to be performed in upstream shader stages. In addition, it can effectively double As you can imagine, with this being a completely hardware-accelerated engine, the data stream never leaves the chip itself so there are massive efficiencies realized for high level geometry processing (like tessellation) within a projection environment.

From a purely practical standpoint, the Simultaneous Multi-Projection Engine can net some massive improvements for VR environments. In these scenarios, the GPU is forced to process two low latency concurrent projections towards the eyes. This in effect doubles the amount of processing power which is traditionally needed but Pascal’s new functional block can process the original geometry workloads at double the speed and it will also have a net positive effect upon pixel rendering.

One example application of SMP is optimal support for surround displays. Traditionally the left and right panels are tilted slightly, allowing for a completely enveloping environment but in all cases this causes visible image warping. The correct way to render to a surround display is with a different projection for every one of the three displays, matching the display angle and eliminating that fish-eye effect we’ve come to associate surround with.

Pascal’s SMP engine can work towards eliminating this problem by specifying three concurrent yet separate projections, each corresponding to the appropriately tilted monitor. Now, you’ll be able to completely modify the angle at which the left and right monitors display their image and you will see the graphics rendered with geometrically correct perspectives, at a much wider field of view. With that being said, an application using SMP to generate surround display images must support wide FOV settings, and also use SMP API calls to enable the wider FOV. In plain English this means developers will need to build this functionality into their games even though it is natively supported at the driver level.

Introducing Fast Sync

Over the last few years we have been treated to something of a revolution in the way flat panels cater to gamers. Essentially both display manufacturers and GPU industry luminaries like NVIDIA and AMD saw that one of the last great hurdles to overcome was how well onscreen movement was replicated.

Even though high refresh rate displays were great for a gamer’s response times, the way they actually presented moving images was filled with problems. Screen tearing and jitter abounds when a graphics card throws out frames faster than a monitor can keep up and that can be just as distracting as a low framerate. Meanwhile V-Sync could effectively eliminate screen tearing but it does so by adding significant juddering and mouse lag into the equation. Additional post-process buffering algorithms were also implemented but they had their own set of problems.

Figuring it was time to offer a solution which took care of screen tearing and judder, both NVIDIA and AMD implemented G-SYNC and FreeSync respectively. I’d recommend you read both of those linked articles since they cover the basics quite well and will act as a good primer for this article.

As a gamer I consider FreeSync and G-SYNC life changing but their intrinsic benefits operate within a preset zone, the top of which is defined by a monitor’s maximum refresh rate. In plain English this means that much like V-SYNC, both solutions will have their framerate capped at your panel’s refresh rate and still feature some mouse latency tradeoffs. Within some in-game situations this is perfectly fine but what about less demanding games that can display potentially hundreds of frames per second? Many of those DX9-era titles like CS:GO, Overwatch, DOTA and League of Legends are currently best played with every panel synchronization technology turned off since players feel the ultra low latency afforded by uncapped framerates gives them a competitive edge. Unfortunately, the aforementioned onscreen distractions were just something they needed to get used to. Until now that is.

Enter Fast Sync into the equation and things change quite significantly. In a nutshell Fast Sync is a technology being pioneered by NVIDIA which complements G-SYNC by offering visual quality improvements in ultra high framerate situations where input latency and fast response times are paramount. Since these operative cases typically lie well above G-SYNC’s effective range, NVIDIA is confident both initiatives can live within the same ecosystem.

To understand what Fast Sync does, let’s late a look at a very simplified version of the rendering pipeline and extrapolate outwards. In a typical scenario information and draw calls get passed from the game engine onto the driver stack through the DirectX API and then gets handed off to the GPU. The GPU renders out the frames to the frame buffer whereupon it is scanned out towards the display.

V-Sync steps boldly into this equation by ordering the game to slow down frame delivery to match the panel’s maximum refresh rate (be it 60Hz, 100Hz, etc.) so only one frame can be generated for every display refresh. As you can imagine, this causes an immense amount of backpressure upon the entire rendering pipeline and introduces a high amount of input latency.

Turn off V-Sync and the pipeline pretty much ignores the monitor’s refresh rate, delivers frames as they are complete and generally runs around causing havoc by spitting out more information than the monitor can hope to display. Think of this like trying to take a drink from a fire hose; some of the water might get into your mouth but the rest will spill all over the place.

Fast Sync as it is being implemented on Pascal moves beyond that typical pipeline process and decouples the rendering and display functions. This allows the rendering stage to continually generate new frames from data sent by the game engine and driver at full speed, and those frames can be temporarily stored in the GPU frame buffer. If needed, they are used straight away or they can be discarded if a newer one is coming down the pipeline.

From what I understand, a decoupled backend could also lead to other technological breakthroughs for the way frames are handled within the rendering pipeline. As such, Fast Sync could represent the tip of a very large iceberg and it may be just a matter of time being additional features based off of this are announced.

Let’s brings things back to the here and now though. With Fast Sync, there is no flow control as there would be with V-SYNC on. The game engine works as if V-Sync is off so various bits of data are handed off towards the GPU as quickly as the architecture demands.

More importantly, since there is no backpressure, input latency is almost as low as within a V-Sync off environment but there’s a distinct lack of tearing since the Fast Sync algorithm chooses which of the rendered frames to scan to the display. As such the pipeline’s primary stages run as fast as they can while Fast Sync determines which frames to scan out to the display, while simultaneously preserving entire frames so they are displayed without tearing. It’s a brilliant process in theory but there are a ton of moving pieces, all of which have to work together in seamless harmony to meet the demands of serious gamers.

The end result of all this wizardry is a gaming experience which is preciously close to the ultra low input lag which a V-Sync Off setting offers without some of the associated visual sacrifices. The processing involved with sifting through frame data isn’t completely latency-free though and it does impart a small amount of hesitation into the scene. However, the additional 8ms will be completely transparent to the human eye.

One important thing to take note of is that since these functions take place directly in the GPU’s rendering pipeline, Fast Sync is completely monitor agnostic. It will work just as well on your four year old 1080P 60Hz monitor as it will on today’s latest and greatest. Unfortunately it will only be available on Pascal since the new architecture is the only one with that additional back-end buffering stage and extra flip logic for sifting through frame data.

Regardless of its Pascal exclusivity, when combined with tertiary elements like Ultra Low Motion Blur, Fast-Sync could literally be a game changer in both the professional scene and on your own gaming monitor.

NVIDIA’s HDR Initiative

Some will argue with great success that High Dynamic Range televisions are the way of the future. HDR is supposed to offer up massive color pallets and bring onscreen images all that much closer to what we see in everyday life. From contrast ratio to the number of colors that can be displayed to brightness, displays with HDR certification are supposed to be superior in every way.

While AMD has already announced their upcoming Polaris architecture will feature a whole barrel of monkeys worth of support for this new standard but, other than a few minor talking points about Maxwell’s adaptability, NVIDIA has been quite silent. They’ve obviously been waiting for Pascal to launch their 21 gun salute to all things HDR.

Alongside a second coming of Maxwell’s support for 12-bit color depth, BT.2020 wide color gamut, SMPTE 2084 (Perceptual Quantization), and HDMI 2.0b 10/12b for 4K HDR, Pascal’s display controller has been further upgraded. It rolls in 4K@60 10/12-bit HEVC decode capabilities which will eventually become a key component in HDR video streams and 4K@60 10b-bit HEVC encode.

Naturally the DisplayPort connectors haven’t been left flapping in the wind and they’ve been upgraded to a 1.4-Ready specification. The operative word here is “ready” since, due to the 1.4 version’s newness (it was just ratified in March) the GTX 1080 hasn’t been officially certified yet.

Perhaps one of the least talked-about aspects of this launch is how the GTX 1080 can positively impact your home gaming environment. When used in conjunction with the SHIELD console and an HDR-supporting game (remember support for HDR has to be baked into a game’s engine) you will be able to use GameStream to play HDR titles on your new TV. This is a great stop-gap gaming method since HDR-supporting PC displays won’t be widely available until sometime in 2017.

For those of you wondering why the SHIELD console -which uses a Maxwell-derivative graphics subsystem in its Tegra X1 SoC- can be utilized for GameStream HDR while 900-series GPU’s can’t, don’t worry NVIDIA isn’t ppulling the wool over your eyes. The Pascal is an essential component in the setup since it incorporates hardware support for the aforementioned 10-bit HEVC standard whereas Maxwell didn’t. Interestingly enough, the Terga X1 also has this updated hardware built into its svelte frame.

Since there are plenty of HDR-facing technologies coming and UDH TV’s have already started rolling out their Premium HDR branding, game developers are racing to catch up. Hence, as the year draws to a close, we will likely see a pretty wide selection of supporting games as well. Exciting stuff!

Test System & Setup

Processor: Intel i7 5960X @ 4.3GHz
Memory: G.Skill Trident X 32GB @ 3000MHz 15-16-16-35-1T
Motherboard: ASUS X99 Deluxe
Cooling: NH-U14S
SSD: 2x Kingston HyperX 3K 480GB
Power Supply: Corsair AX1200
Monitor: Dell U2713HM (1440P) / Acer XB280HK (4K)
OS: Windows 10 Pro

Drivers:
AMD Radeon Software 16.5.2
NVIDIA 368.14 WHQL

*Notes:

– All games tested have been patched to their latest version

– The OS has had all the latest hotfixes and updates installed

– All scores you see are the averages after 3 benchmark runs

All IQ settings were adjusted in-game and all GPU control panels were set to use application settings

The Methodology of Frame Testing, Distilled

How do you benchmark an onscreen experience? That question has plagued graphics card evaluations for years. While framerates give an accurate measurement of raw performance , there’s a lot more going on behind the scenes which a basic frames per second measurement by FRAPS or a similar application just can’t show. A good example of this is how “stuttering” can occur but may not be picked up by typical min/max/average benchmarking.

Before we go on, a basic explanation of FRAPS’ frames per second benchmarking method is important. FRAPS determines FPS rates by simply logging and averaging out how many frames are rendered within a single second. The average framerate measurement is taken by dividing the total number of rendered frames by the length of the benchmark being run. For example, if a 60 second sequence is used and the GPU renders 4,000 frames over the course of that time, the average result will be 66.67FPS. The minimum and maximum values meanwhile are simply two data points representing single second intervals which took the longest and shortest amount of time to render. Combining these values together gives an accurate, albeit very narrow snapshot of graphics subsystem performance and it isn’t quite representative of what you’ll actually see on the screen.

FCAT on the other hand has the capability to log onscreen average framerates for each second of a benchmark sequence, resulting in the “FPS over time” graphs. It does this by simply logging the reported framerate result once per second. However, in real world applications, a single second is actually a long period of time, meaning the human eye can pick up on onscreen deviations much quicker than this method can actually report them. So what can actually happens within each second of time? A whole lot since each second of gameplay time can consist of dozens or even hundreds (if your graphics card is fast enough) of frames. This brings us to frame time testing and where the Frame Time Analysis Tool gets factored into this equation.

Frame times simply represent the length of time (in milliseconds) it takes the graphics card to render and display each individual frame. Measuring the interval between frames allows for a detailed millisecond by millisecond evaluation of frame times rather than averaging things out over a full second. The larger the amount of time, the longer each frame takes to render. This detailed reporting just isn’t possible with standard benchmark methods.

We are now using FCAT for ALL benchmark results in DX11.

DX12 Benchmarking

For DX12 many of these same metrics can be utilized through a simple program called PresentMon. Not only does this program have the capability to log frame times at various stages throughout the rendering pipeline but it also grants a slightly more detailed look into how certain API and external elements can slow down rendering times.

Since PresentMon throws out massive amounts of frametime data, we have decided to distill the information down into slightly more easy-to-understand graphs. Within them, we have taken several thousand datapoints (in some cases tens of thousands), converted the frametime milliseconds over the course of each benchmark run to frames per second and then graphed the results. This gives us a straightforward framerate over time graph. Meanwhile the typical bar graph averages out every data point as its presented.

One thing to note is that our DX12 PresentMon results cannot and should not be directly compared to the FCAT-based DX11 results. They should be taken as a separate entity and discussed as such.

Ashes of the Singularity

Ashes of the Singularity is a real time strategy game on a grand scale, very much in the vein of Supreme Commander. While this game is most known for is Asynchronous workloads through the DX12 API, it also happens to be pretty fun to play. While Ashes has a built-in performance counter alongside its built-in benchmark utility, we found it to be highly unreliable and often posts a substantial run-to-run variation. With that in mind we still used the onboard benchmark since it eliminates the randomness that arises when actually playing the game but utilized the PresentMon utility to log performance


Fallout 4

The latest iteration of the Fallout franchise is a great looking game with all of its detailed turned to their highest levels but it also requires a huge amount of graphics horsepower to properly run. For this benchmark we complete a run-through from within a town, shoot up a vehicle to test performance when in combat and finally end atop a hill overlooking the town. Note that VSync has been forced off within the game’s .ini file.


Far Cry 4

This game Ubisoft’s Far Cry series takes up where the others left off by boasting some of the most impressive visuals we’ve seen. In order to emulate typical gameplay we run through the game’s main village, head out through an open area and then transition to the lower areas via a zipline.


Grand Theft Auto V

In GTA V we take a simple approach to benchmarking: the in-game benchmark tool is used. However, due to the randomness within the game itself, only the last sequence is actually used since it best represents gameplay mechanics.


Hitman (2016)

The Hitman franchise has been around in one way or another for the better part of a decade and this latest version is arguably the best looking. Adjustable to both DX11 and DX12 APIs, it has a ton of graphics options, some of which are only available under DX12.

For our benchmark we avoid using the in-game benchmark since it doesn’t represent actual in-game situations. Instead the second mission in Paris is used. Here we walk into the mansion, mingle with the crowds and eventually end up within the fashion show area.


Rise of the Tomb Raider

Another year and another Tomb Raider game. This time Lara’s journey continues through various beautifully rendered locales. Like Hitman, Rise of the Tomb Raider has both DX11 and DX12 API paths and incorporates a completely pointless built-in benchmark sequence.

The benchmark run we use is within the Soviet Installation level where we start in at about the midpoint, run through a warehouse with some burning its and then finish inside a fenced-in area during a snowstorm.[/I]


Star Wars Battlefront

Star Wars Battlefront may not be one of the most demanding games on the market but it is quite widely played. It also looks pretty good due to it being based upon Dice’s Frostbite engine and has been highly optimized.

The benchmark run in this game is pretty straightforward: we use the AT-ST single player level since it has predetermined events and it loads up on many in-game special effects.


The Division

The Division has some of the best visuals of any game available right now even though its graphics were supposedly downgraded right before launch. Unfortunately, actually benchmarking it is a challenge in and of itself. Due to the game’s dynamic day / night and weather cycle it is almost impossible to achieve a repeatable run within the game itself. With that taken into account we decided to use the in-game benchmark tool.


Witcher 3

Other than being one of 2015’s most highly regarded games, The Witcher 3 also happens to be one of the most visually stunning as well. This benchmark sequence has us riding through a town and running through the woods; two elements that will likely take up the vast majority of in-game time.


Ashes of the Singularity

Ashes of the Singularity is a real time strategy game on a grand scale, very much in the vein of Supreme Commander. While this game is most known for is Asynchronous workloads through the DX12 API, it also happens to be pretty fun to play. While Ashes has a built-in performance counter alongside its built-in benchmark utility, we found it to be highly unreliable and often posts a substantial run-to-run variation. With that in mind we still used the onboard benchmark since it eliminates the randomness that arises when actually playing the game but utilized the PresentMon utility to log performance


Fallout 4

The latest iteration of the Fallout franchise is a great looking game with all of its detailed turned to their highest levels but it also requires a huge amount of graphics horsepower to properly run. For this benchmark we complete a run-through from within a town, shoot up a vehicle to test performance when in combat and finally end atop a hill overlooking the town. Note that VSync has been forced off within the game’s .ini file.


Far Cry 4

This game Ubisoft’s Far Cry series takes up where the others left off by boasting some of the most impressive visuals we’ve seen. In order to emulate typical gameplay we run through the game’s main village, head out through an open area and then transition to the lower areas via a zipline.


Grand Theft Auto V

In GTA V we take a simple approach to benchmarking: the in-game benchmark tool is used. However, due to the randomness within the game itself, only the last sequence is actually used since it best represents gameplay mechanics.


Hitman (2016)

The Hitman franchise has been around in one way or another for the better part of a decade and this latest version is arguably the best looking. Adjustable to both DX11 and DX12 APIs, it has a ton of graphics options, some of which are only available under DX12.

For our benchmark we avoid using the in-game benchmark since it doesn’t represent actual in-game situations. Instead the second mission in Paris is used. Here we walk into the mansion, mingle with the crowds and eventually end up within the fashion show area.


Rise of the Tomb Raider

Another year and another Tomb Raider game. This time Lara’s journey continues through various beautifully rendered locales. Like Hitman, Rise of the Tomb Raider has both DX11 and DX12 API paths and incorporates a completely pointless built-in benchmark sequence.

The benchmark run we use is within the Soviet Installation level where we start in at about the midpoint, run through a warehouse with some burning its and then finish inside a fenced-in area during a snowstorm.[/I]


Star Wars Battlefront

Star Wars Battlefront may not be one of the most demanding games on the market but it is quite widely played. It also looks pretty good due to it being based upon Dice’s Frostbite engine and has been highly optimized.

The benchmark run in this game is pretty straightforward: we use the AT-ST single player level since it has predetermined events and it loads up on many in-game special effects.


The Division

The Division has some of the best visuals of any game available right now even though its graphics were supposedly downgraded right before launch. Unfortunately, actually benchmarking it is a challenge in and of itself. Due to the game’s dynamic day / night and weather cycle it is almost impossible to achieve a repeatable run within the game itself. With that taken into account we decided to use the in-game benchmark tool.


Witcher 3

Other than being one of 2015’s most highly regarded games, The Witcher 3 also happens to be one of the most visually stunning as well. This benchmark sequence has us riding through a town and running through the woods; two elements that will likely take up the vast majority of in-game time.


Ashes of the Singularity

Ashes of the Singularity is a real time strategy game on a grand scale, very much in the vein of Supreme Commander. While this game is most known for is Asynchronous workloads through the DX12 API, it also happens to be pretty fun to play. While Ashes has a built-in performance counter alongside its built-in benchmark utility, we found it to be highly unreliable and often posts a substantial run-to-run variation. With that in mind we still used the onboard benchmark since it eliminates the randomness that arises when actually playing the game but utilized the PresentMon utility to log performance


Hitman (2016)

The Hitman franchise has been around in one way or another for the better part of a decade and this latest version is arguably the best looking. Adjustable to both DX11 and DX12 APIs, it has a ton of graphics options, some of which are only available under DX12.

For our benchmark we avoid using the in-game benchmark since it doesn’t represent actual in-game situations. Instead the second mission in Paris is used. Here we walk into the mansion, mingle with the crowds and eventually end up within the fashion show area.


Quantum Break

Years from now people likely won’t be asking if a GPU can play Crysis, they’ll be asking if it was up to the task of playing Quantum Break with all settings maxed out. This game was launched as a horribly broken mess but it has evolved into an amazing looking tour de force for graphics fidelity. It also happens to be a performance killer.

Though finding an area within Quantum Break to benchmark is challenging, we finally settled upon the first level where you exit the elevator and find dozens of SWAT team members frozen in time. It combines indoor and outdoor scenery along with some of the best lighting effects we’ve ever seen.


Rise of the Tomb Raider

Another year and another Tomb Raider game. This time Lara’s journey continues through various beautifully rendered locales. Like Hitman, Rise of the Tomb Raider has both DX11 and DX12 API paths and incorporates a completely pointless built-in benchmark sequence.

The benchmark run we use is within the Soviet Installation level where we start in at about the midpoint, run through a warehouse with some burning its and then finish inside a fenced-in area during a snowstorm.[/I]


Ashes of the Singularity

Ashes of the Singularity is a real time strategy game on a grand scale, very much in the vein of Supreme Commander. While this game is most known for is Asynchronous workloads through the DX12 API, it also happens to be pretty fun to play. While Ashes has a built-in performance counter alongside its built-in benchmark utility, we found it to be highly unreliable and often posts a substantial run-to-run variation. With that in mind we still used the onboard benchmark since it eliminates the randomness that arises when actually playing the game but utilized the PresentMon utility to log performance


Hitman (2016)

The latest iteration of the Fallout franchise is a great looking game with all of its detailed turned to their highest levels but it also requires a huge amount of graphics horsepower to properly run. For this benchmark we complete a run-through from within a town, shoot up a vehicle to test performance when in combat and finally end atop a hill overlooking the town. Note that VSync has been forced off within the game’s .ini file.


Quantum Break

Years from now people likely won’t be asking if a GPU can play Crysis, they’ll be asking if it was up to the task of playing Quantum Break with all settings maxed out. This game was launched as a horribly broken mess but it has evolved into an amazing looking tour de force for graphics fidelity. It also happens to be a performance killer.

Though finding an area within Quantum Break to benchmark is challenging, we finally settled upon the first level where you exit the elevator and find dozens of SWAT team members frozen in time. It combines indoor and outdoor scenery along with some of the best lighting effects we’ve ever seen.


Rise of the Tomb Raider

The Hitman franchise has been around in one way or another for the better part of a decade and this latest version is arguably the best looking. Adjustable to both DX11 and DX12 APIs, it has a ton of graphics options, some of which are only available under DX12.

For our benchmark we avoid using the in-game benchmark since it doesn’t represent actual in-game situations. Instead the second mission in Paris is used. Here we walk into the mansion, mingle with the crowds and eventually end up within the fashion show area.


Analyzing Temperatures & Frequencies Over Time

Modern graphics card designs make use of several advanced hardware and software facing algorithms in an effort to hit an optimal balance between performance, acoustics, voltage, power and heat output. Traditionally this leads to maximized clock speeds within a given set of parameters. Conversely, if one of those last two metrics (those being heat and power consumption) steps into the equation in a negative manner it is quite likely that voltages and resulting core clocks will be reduced to insure the GPU remains within design specifications. We’ve seen this happen quite aggressively on some AMD cards while NVIDIA’s reference cards also tend to fluctuate their frequencies. To be clear, this is a feature by design rather than a problem in most situations.

In many cases clock speeds won’t be touched until the card in question reaches a preset temperature, whereupon the software and onboard hardware will work in tandem to carefully regulate other areas such as fan speeds and voltages to insure maximum frequency output without an overly loud fan. Since this algorithm typically doesn’t kick into full force in the first few minutes of gaming, the “true” performance of many graphics cards won’t be realized through a typical 1-3 minute benchmarking run. Hence why we use a 10-minute warm up period before all of our benchmarks.

For now, let’s see how these new algorithms are used when the card is running at default speeds.

Despite the fact NVIDIA showed a GTX 1080 running at over 2GHz and under 70°C at their launch event, the actual reality is quite different. Here we see the core temperatures gradually rising to around 83°C before they are eventually tamed by a combination of the heatsink’s fan speed and voltage / clock speed manipulation. While this is far from the GTX 1080’s official throttle temperature of 90°C, the trickle-down effect it has upon clock speeds and performance is quite interesting to see.

While temperatures climb quite steadily, looking at the chart above it becomes quite apparent that NVIDIA’s fan speed profile doesn’t really react until the core reaches approximately 70°C whereupon it begins a gradual ramp-up before it hits a plateau near 2000RPMs. That stately and sedate RPM increase is extremely welcome given the noticeable noise caused by the GTX 980 Ti’s rapid ascension to its fan’s rotational plateau but it isn’t quite as lethargic as the GTX 980’s profile either.

Due to the fact that GPU Boost insures there’s a direct correlation between the core’s temperature and its maximum frequency, as the GTX 1080 gets hotter its speed starts fluctuating a bit. The reduction in frequency is minor in the grand scheme of things (about 150MHz) but we have certainly seen better results from NVIDIA in this metric. To make matters even more interesting it seems like the GTX 1080’s aggressive power draw limiter is behind this frequency drop-off rather that temperatures.

Naturally, with the frequency reduction there’s a corresponding framerate hit as well. In this case Rise of the Tomb Raider goes from running at 75FPS to hovering between 72FPS and so we’re looking at an approximate 4% reduction in real-world performance. To avoid this you will need to use a custom fan profile (just expect a corresponding noise increase as well) through an application like EVGA’s Precision or wait for custom cooled GTX 1080 cards from NVIDIA’s board partners. For the record, this isn’t exactly a great result for a card that commands a $100 premium over what will likely be very well behaved alternate solutions from EVGA, ASUS, Gigabyte, MSI and others.

Thermal Imaging

As with all full-coverage coolers, there really isn’t much to see on the thermal imaging shots other than a small head point directly below the core on the backplate. Since all of the heat is effectively exhausted out the back, there should be very little worry about temperatures problems with motherboard-mounted components.

Acoustical Testing

What you see below are the baseline idle dB(A) results attained for a relatively quiet open-case system (specs are in the Methodology section) sans GPU along with the attained results for each individual card in idle and load scenarios. The meter we use has been calibrated and is placed at seated ear-level exactly 12” away from the GPU’s fan. For the load scenarios, Rise of the Tomb Raider is used in order to generate a constant load on the GPU(s) over the course of 15 minutes.

As we’ve already seen the GTX 1080’s fan curve is somewhat lethargic to the point of it possibly not being aggressive enough. With that being said, this is a pretty quiet card which is only beaten out by a few other substantially less powerful options. It should be interesting to see what board partners do with this thing since I’m sure they will have cooler designs that exhibit lower temperatures and even quieter noise levels.

System Power Consumption

For this test we hooked up our power supply to a UPM power meter that will log the power consumption of the whole system twice every second. In order to stress the GPU as much as possible we used 15 minutes of Rise of the Tomb Raider while letting the card sit at a stable Windows desktop for 15 minutes to determine the peak idle power consumption.

With its 16nm manufacturing process, advanced power delivery system and relatively cool temperatures, it should be no surprise that the GTX 1080 is a power miser. When you take these results and combine them with the absolutely spectacular performance numbers, this becomes the absolute best performance per watt GPU available by a country mile. It really is incredible what NVIDIA has accomplished here.

Overclocking Results

Overclocking the GTX 1080 and all upcoming Pascal cards will be a different affair from Maxwell, Kepler and other NVIDIA cards. While the Power, Voltage and Temperature limits are all present and accounted for once again, some additional features have been added in an effort to squeeze every ounce of extra clock speed out of these cores.

NVIDIA’s GTX 1080 takes this clock speed balancing act to the next level with GPU Boost 3.0 which utilizes all of the additional measurement points and event prediction from the 2.0 iteration and builds in even more granularity for overclocking. GPU Boost 2.0 utilized a fixed frequency offset which adjusted clock speeds and voltages based off of a single voltage-based linear multiplier. While this did allow for good frequency scaling relative to the attainable maximum, some performance was still left on the cutting room floor.

GPU Boost 3.0 endeavors to rectify this situation by giving overclockers the possibility to adjust clock speeds and their associated voltages across several points along a voltage path, thus –in theory- delivering better performance than would have otherwise been attained. Basically, since Pascal features many different voltage read points, they can be adjusted one at a time to insure frequencies at each level are maximized.

EVGA’s Precision tool has been upgraded to take advantage of this new functionality by adding three different voltage scaling options: Basic, Linear and Manual. Within the new dialog box you will find 30 small bars, each of which indicates a separate voltage point which can be increased or decreased based on stability. In addition, each shows the final attainable voltage and Base clock speed if the overclock is tested as stable.

In Basic mode you simply click a point above one of the voltage bars and a green line will move upwards showing an approximate visual-based level for each point. This is extremely straightforward but it is simply a quick and easy fix for a no-nonsense overclock. In testing I found this to be the least exacting way of squeezing performance out of the GTX 1080.

I wasn’t exactly successful with the Linear option either since even though it offers a bit more flexibility, it still felt a bit limiting since the targets move in a very predetermined pattern.

The fully Manual area is where I achieved the best results since it grants a user complete control over the entire system and each point’s voltage level. They can each be adjusted in such a way that stability can be virtually insured if you spend enough time at it. With these in hand you can (and should!) slowly but surely manipulate them until the optimal combination of voltages increases –or the “heights” of each bar- and clock speeds is found. Boosting a single bar to its maximum value is a surefire way to failure though since one voltage point cannot hope to be stable with the weight of an overclock pressing down on it. Granted, this option requires a substantial investment in terms of effort but the payoffs are certainly there as you will see in the results below.

If you choose to skip using your own judgment in this respect, EVGA has a Scan tool in the Manual section which will slowly adjust each point, run their OC Scanner utility to test for stability and then move onto the next point. While it wasn’t completely working during our testing, it should be ready in time for launch.

So with all of this being said, what did I end up settling on as an overclock?

Well that isn’t all that shabby now is it? After more than two hours working on Precision the core was able to hit a constant Boost speed of 2126MHz, representing a pretty shocking 400MHz+ overclock when compared to the standard load results I achieved. The memory got in on the party as well with a decent showing at 5670MHz. This was all accomplished while maintaining a fan speed of 55% which proved to be quiet enough to not disturb gaming sessions.

Now actually getting to this point wasn’t easy and it should be noted that even though NVIDIA’s cooler is quite capable, its default fan profile is absolute garbage if you intend on overclocking. I also smashed head first into a voltage and Power Limit wall so even if I would have been able to achieve higher theoretical frequencies, they would have been dragged back down to earth by NVIDIA’s Boost algorithms.

Performance scaling was extremely decent as well, as evidenced by the in-game results below. OK, now I’m being facetious….the resulting performance from these overclocks is mind-blowing. I want to put this into context for everyone: when overclocked, the GTX 1080 can achieve performance that’s close to TWO GTX 980 Ti’s in SLI.



Conclusion; Mission Accomplished & Then Some!

NVIDIA’s GTX 1080 represents something almost unique it today’s computer component market, a space that has been continually subjected to incremental improvements from one product generation to the next. I can’t remember the last time a product allowed me to write from the heart instead of trying to place some kind of positive spin on the latest yearly stutter that may have brought a bit more performance to the table. Pascal and by extension the GTX 1080 have changed that in a big way by offering a leap forward in terms of graphical efficiency, overall performance and a top-to-bottom feature set. Not only am I excited about what this kind of launch does to the competitive landscape –they say challenges breed innovation- but I’m also anxious to see what developers will accomplish with this newfound horsepower.

To say to say the GTX 1080 exceeded expectations understating things by an order of magnitude. While NVIDIA did spill some of the beans with their nebulous but nonetheless cheer-inducing launch event performance graphs, the full reality of the situation is still actually a bit awe-inspiring. What’s been accomplished here is a generational performance shift of a size not seen since Fermi launched and ushered in the DX1 age. And yet for a multitude of reasons Pascal is more impressive than Fermi ever was.

From a raw framerate producing standpoint its impossible not to be impressed with the GTX 1080. Rather than towing the usual conservative inter-generational line of 15-20% increases that keep buyers of the last architecture’s flagships content, it demolishes preconceptions. Remember, like the GTX 780 before it, the GTX 980 offered performance that was about 30% better than its direct predecessor. With the GTX 1080 we are seeing a roughly 68% improvement over the GTX 980’s framerates at 1440P and in some memory-limited scenarios the delta between these cards can reach much higher than that. For GTX 680 and GTX 780 users, this thing will act like a massive defibrillator shock for their systems’ flagging performance in today’s latest titles.

Against the GTX 980 Ti, a card that launched for $649 and just until recently was considered an ultra high end option the GTX 1080 actually looks like a viable upgrade path, particularly when overclocked. That’s something that could have never been said about the GTX 780 Ti to GTX 980 metric. Not only does it offer 35% higher (on average) framerates than NVIDIA’s erstwhile flagship but it does so while consuming less power. While we couldn’t add it into these charts in time, the $999 TITAN X is about 3% faster than the GTX 980 Ti so it would still be beaten like a lazy donkey by NVIDIA’s latest 104-series core. Looking at these results, I can’t help but be anxious for what the GTX 1070 could potentially bring to the table for more budget-conscious gamers.

With all of this being taken into account the GTX 1080 is able to walk all over the R9 Fury X too, at least in DX11 situations. NVIDIA is obviously marching to the beat of a different drummer but don’t count AMD out of the fight just yet. By looking past the initial numbers versus the GTX 1080 we can see AMD’s driver team has been able to leverage their architecture’s strengths and the Fury X is now able to step ahead of NVIDIA’s GTX 980 Ti more often than not.

DX12 actually proved to be an interesting counterpoint but one which proved absolutely detrimental to the GTX 980. In many applications its memory bandwidth was simply overwhelmed, leading to the GTX 1080 nearly doubling up on its results. There was some face-saving however in 4K DX12 since we needed to turn down MSAA in Ashes to achieve somewhat playable framerates on all the cards. This bodes well for Pascal but I find myself wondering how well GTX 980 users are prepared for gaming’s DX12 future because from these initial tests it seems like dire news indeed. The GTX 980 Ti versus GTX 1080 equation is literally identical here as it was in DX11 but the new core does show some flashes of vast superiority when asked to render bandwidth-hogging DX12 workloads.

With the GTX 1080 NVIDIA has thrown up a huge bulwark against whatever AMD has coming down the pipeline but it is numbers like this which should give Radeon users some real hope. Their DX12 performance in general is still very strong, making it evident the Fiji architecture is extremely forward thinking in the way it handles the new API’s draw calls and asynchronous workloads. Whether or not that continues into the future is anyone’s guess but at times, in very select benchmarks the Fury X can really power through these scenarios.

There is one major caveat with these DX12 results. Right now the API is still extremely immature and developers are still coming to grips with its various functionalities, hence why some benchmarks actually saw a performance reduction versus DX11. Both driver support and DX12’s integration into games still have a long way to go so above all else don’t base your purchasing decision solely upon these early benchmarks regardless of how much they favor NVIDIA’s new architecture.

So the GTX 1080 is wonderfully fast. Ludicrously fast even. It’ll make recent GTX 980 Ti buyers curse their impatience, cause GTX 980 owners to look into selling off various appendages just to buy one and it should keep AMD awake at night praying that Polaris is up to the task of competing. But regardless of how much I want to scream like a Justin Bieber fangirl at what the Pascal architecture is offering, the cynic in me realizes it isn’t perfect.

Let’s start with the already-infamous Founders Edition and its associated price, two things I haven’t really discussed until this juncture. Some items contained in this launch like the SLI Enthusiast key I can overlook as being potentially beneficial over the long term for certain niches but the FE is a head scratcher.

No matter how much NVIDIA wants to play up its premium design elements and carefully selected components, the $100 additional investment required by the Founders Edition’s will be extremely hard to justify over whatever $599 alternatives their partners are working on. This just highlights the extreme disconnect that comes along with this whole affair; I’m already comparing the Founders Edition to unreleased, unannounced, hypothetical cards and telling you to wait before jumping onto the GTX 1080 bandwagon. Why? Because they’ll hopefully offer better performance consistency than the “reference” heatsink you pay so dearly for and at least by that time you’ll know what the competitive landscape looks like. And let’s be introspective for a moment; unlike other Founders / Day One / Backers editions this one doesn’t come with any extras or exclusive goodies. Luckily, that blower setup will be an awesome addition for Small Form Factor systems that live or die by temperature levels inside the chassis.

Speaking of price, for all the GTX 1080’s impressive performance benefits I’m forced to evaluate this thing as a $699 graphics card because until we see otherwise, that’s exactly what it is. The Founders Edition may very well be the only SKU available in sufficient quantities come launch day on May 27th so early adopters will have to happily chow down on that $100 “blower tax” for the chance to own one. NVIDIA knows their customers will do exactly that and they’ve priced the reference card (no, I won’t stop calling it that!) accordingly.

The GTX 1080 is clearly a superior product that completely overturns the graphics card market as we know it. While $699 will be a bitter pill to swallow for some and it may point towards a gradual uptick in the price we all pay for GPUs, there’s no denying that the GTX 1080 Founders Edition still offers phenomenal bag for your buck. Meanwhile, the $599 versions could end up being absolutely spectacular. Regardless of what you think about NVIDIA’s pricing structure you have to appreciate what they’ve accomplished: with one single finely crafted, high performance graphics core they’ve made us all lust after an inanimate but oh-so-sexy object.

Posted in

Latest Reviews