What's new
  • Please do not post any links until you have 3 posts as they will automatically be rejected to prevent SPAM. Many words are also blocked due to being used in SPAM Messages. Thanks!

NVIDIA’s GeForce GF100 Under the Microscope

Status
Not open for further replies.

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal



NVIDIA’s GeForce GF100 Under the Microscope






Months ago at NVIDIA's GPU Technology Conference, CEO Jen Hsun Huang announced the upcoming Fermi architecture to the world. Due to the nature of that conference, very few -if any- details were leaked regarding the architecture's performance in the one area where NVIDIA's roots lie: 3D graphics processing. All we knew at that time was Fermi-based cards weren’t anywhere near mass production capable and that ATI already had troops on the ground in the DX11 marketplace. Since then, ATI has continued to run away with the DX11 GPU market and things couldn’t have looked worse for Team Green. Designing a whole new architecture from the ground up takes time and NVIDIA was always the first to admit that.

If anything, NVIDIA has been extremely tight-lipped about all things Fermi ever since the GTC but from the information we received from our moles within their Santa Clara offices, things were moving along at a quick pace. New technologies were being developed to showcase the chip’s advanced capabilities, features were being toyed with, drivers were written and TSMC was pistol-whipped into shape after showing disappointing yields on early silicon. As CES rolled around, NVIDIA wanted to put most of their cards on the table and finally disclosed how they have adapted Fermi's architecture for the consumer GPU marketplace. This included selective tech demos and hardware being shown on the show floor as well as all-day “Deep Dive” briefings for select journalists being conducted behind the scenes. We attended one of these briefings and in this article we will finally shed some light on the technologies that will make the Fermi architecture hum along in games and other 3D apps.

While we have been talking about Fermi as the name for the all-encompassing architecture up to this point, it should be pointed out that there will be several sub categories as well. This includes the already-announced Tesla C2000-series for HPC markets and an unnamed series of Quadro cards as well. The market we are most interested in for the purposes of this article is that which holds actual 3D graphics cards and NVIDIA has finally come forth with a name for us: GF100. However, the “GF100” moniker won’t be used for the final retail cards but will be used to describe a whole range of products based off of the GeForce version of the Fermi architecture. It may not be telling much but at least it gives us an alternative to the overused “Fermi” name.

Before we really get into things, we should warn you that while there was actually a shed-load of information discussed, NVIDIA was still silent when it came to discussing price, final clock speeds, memory sizes, and availability. Nonetheless, we’ll do what we can to make some educated guesses based off of what we know and what was hinted at within our briefing.

 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
The Future of Games: Attaining Geometric Realism

The Future of Games: Attaining Geometric Realism


Today’s GPUs are interesting beasts that have been tailor-made for their associated APIs but when adding advanced rendering techniques they are quickly brought to their knees. Basically, DX9 and DX10 were all about pixel shading horsepower over all else and current GPUs excel in this department but DX11 is another matter altogether. It adds features such as tessellation which doesn’t necessarily increase the amount of shading horsepower needed but adds new options for geometric realism.

One of the main issues with past generations is that while there was serious headway made when it came to increasing shading power, the same couldn’t be said for geometry performance. NVIDIA gave an interesting example by stating that the GT200 had about 150 times the amount of peak shading performance as their FX5800 but less than three times the amount of geometry performance.

So let’s take a look at what it takes for geometric realism.


If you look at the picture above from Far Cry 2, it’s apparent that the textures and shadows are extremely detailed as is befitting for today’s higher-end DX9 and DX10 architecture. However, upon closer inspection there are several things about the scene that could be done a bit better through increased geometric detail. Surfaces like the gun holster and the woman’s shoulder that should be mapped smoothly look slightly jagged while characters normally have their hair covered...or their hair just looks like a helmet. The problem lies in the fact that a large amount of geometric detail is needed to accurately model these items and current GPUs don’t have the horsepower to do that. As such, developers’ hands are tied even if they wanted to add higher levels of detail.


Non tessellated images on left, DX11 tessellation on right

So what kind of tools do developers have at their fingertips in order to increase the level of geometric detail while not overly impacting performance? To begin with, DX11 offers several paths which can be used for heightened yet efficient rendering of geometric detail while offering several similarities with past APIs to decrease the learning curve.

Notably, tessellation can increase geometric detail as you can see in the images above while adding new three-dimensionality to objects. If you are fortunate enough to have a card capable of using the Unigine program in DX11 mode, you will be able to see how displacement mapping combined with tessellation allows for some structural occlusion and increased detail levels. As such, shadows can become a reflection of the geometry and can move around dynamically as the light source moves.

While software tessellation on the CPU is possible, the information still has to be shipped over the PCI-E bus before making its way to the GPU. This is an inefficient and time-consuming process that will slow a system down to a crawl in fairly short order. DX11 on the other hand offers computational efficiency by moving all of the operations onto the GPU itself.

Another tool in the DX11 developers’ bag is a truly dynamic level of detail (LoD) that increases the detail as you approach a given object. Dynamic LoD saves on resources since the system isn’t forced to render several high-detail objects within a scene at the same time and gives priority to objects closer to the viewer.


Certain compute aspects of DX11 can also be added to tessellated scenes to give a sense of realistic movement without the need for vast CPU resources. This is where DirectCompute comes into the equation since it allows animations (among other things) to be done directly on the GPU in concert with rendering. Both hair and water were infinitely hard to render within past APIs but the images you see above were rendered in real-time on GF100 hardware through the use of tessellation, DirectCompute and selective geometry shading.

Even though developers are a huge part of the process, the real trick is to make an architecture that allows all of these operations to be done efficiently and without the resource-hogging overhead of today’s GPUs and APIs. As we get further into the technological aspects of the GF100 architecture, you will begin to see how NVIDIA has made effective use of the tools within DX11 and have an architecture built from the ground up for this API.
 

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
In-Depth GF 100 Architecture Analysis (Core Layout)

In-Depth GF 100 Architecture Analysis (Core Layout)


The first stop on this whirlwind tour of the GeForce GF100 is an in-depth look at what makes the GPU tick as we peer into the core layout and how NVIDIA has designed this to be the fastest graphics core on the planet.

Many people incorrectly believed that the Fermi architecture was primarily designed for GPU computing applications and very little thought was given to the graphics processing capabilities. This couldn’t be further from the truth since the computing and graphics capabilities were determined in parallel and the result is a brand new architecture tailor made to live in a DX11 environment. Basically, NVIDIA needed to apply what they had learned from past generations (G80 & GT200) to the GF100.


What you are looking at above is the heart and soul of any GF100 card: the core layout. While we will go into each section in a little more detail below, from the overall view we can see that the main functions are broken up into four distinct groups called Graphics Processing Clusters or GPCs which are then broken down again into individual Streaming Multiprocessors (SMs), raster engines and so on. To make matters simple, think of it way: in its highest-end form, a GF100 will have four GPCs, each of which is equipped with four SMs for a total of 16 SMs broken up into groups of four. Within each of these SMs are 32 CUDA Cores (or shader processors from past generations) for a total of 512 cores in total.

On the periphery of the die is the GigaThread Engine along with the memory controllers. The GigaThread Engine performs the somewhat thankless duty of reading the CPU’s commands over the host interface and then fetching data from the system’s main memory bank. The data is then copied over onto the framebuffer of the graphics card itself before being passed along to the designated engine within the core. Meanwhile, the GF100 incorporates a total of six 64-bit GDDR5 memory controllers for a total of 384-bits. The massive amount of bandwidth created by a 384-bit GDDR5 memory interface will provide extremely fast access to the system memory and eliminate any bottlenecks seen in past generations.


Each Streaming Multiprocessor holds 32 CUDA cores along with 16 load / store units which allows for a total of 16 threads per clock to be processed. Above these we see Warp Schedulers along with the associated dispatch units which process 32 concurrent threads (called Warps) to the cores.

Finally, closer to the bottom of the SM there is the L1 / L2 cache, Polymorph Engine and the four texture units. In total, the maximum number of texture units in this architecture is 64 which should come as a surprise considering the outgoing GT200 architecture supported up to 80 TMUs. However, NVIDIA has implemented a number of improvements with the way the architecture handles textures which we will go into in a later section. Suffice to say that the texture units are now integrated into the SP without having multiple SPs addressing a common texture cache.


Independent of the SM structure is six dedicated partitions of eight ROP units for a total of 48 ROPs as opposed to the 32 units from the GT200 architecture. Also different from the GT200 layout is that instead of backing up directly into the memory bus, the ROPs interface with the shared L2 cache which provides a quick interface for data storage.
 

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
Efficiency Through Caching

Efficiency Through Caching


If there is one thing that was constantly drilled into our heads by NVIDIA during the sessions it was the benefits of having dedicated L1 and L2 caches. There are benefits to this approach not only when it comes to GPGPU computing but also for storing draw calls so they are not passed off to the memory on the graphics card. This is supposed to drastically streamline rendering efficiency, especially in situations with a lot of higher-level geometry.


Above we have an enlarged section of the cache and memory layout within each SM. To put things into perspective, an SM has 64KB of shared, programmable on-chip memory that can be configured in one of two ways. It can either be laid out as 48 KB of shared memory with 16 KB of L1 cache, or as 16 KB of Shared memory with 48 KB of L1 cache. However, when used for graphics processing as opposed to GPGPU functions, the SM will use of the 16 KB L1 cache configuration. This L1 cache is supposed to help with access to the L2 cache as well as streamlining functions like stack operations and global loads / stores.

In addition, each texture unit now has its own high efficiency cache as well which helps with rendering speed.


Through the L2 cache architecture NVIDIA is able to keep most of the rendering function data like tessellation, shading and rasterizing on-die instead of going to the framebuffer (DRAM) which would slow down the process. Caching for the GPU benefits bandwidth amplification and alleviates memory bottlenecks which normally occur when doing multiple reads and writes to the framebuffer. In total, the GF100 has 768KB of L2 cache which is dynamically load balanced for peak efficiency.

It is also possible for the L1 and L2 cache to do loads and stores to memory and pass data from engine to engine so nothing moves off chip. Unfortunately, one of the issues with this approach is that significant die area is taken up by doing geometry processing in a parallel and scalable way while not using DRAM bandwidth.


When compared with the new GeForce GF100, the previous architecture is inferior in every way. The GT200 only used cache for textures and featured a read-only L2 cache structure whereas the new GPU’s L2 is rewritable and caches everything from vertex data to textures to ROP data and nearly everything in between.

By contrast, with their Radeon HD 5000-series, ATI dumps all of the data from the geometry shaders to the memory and then pulls it back into the core for rasterization before output. This causes a drop in efficiency and therefore performance. Meanwhile, as we discussed before, NVIDIA is able to keep all of their functions on-die in the cache without having to introduce memory latency into the equation and hogging bandwidth.

So what does all of this mean for the end-user? Basically, it means vastly improved memory efficiency since less bandwidth is being taken up by unnecessary read and write calls. This can and will benefit the GF100 in high resolution, high IQ situations where lesser graphics cards’ framebuffers can easily become saturated.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
A Closer Look at the Raster & PolyMorph Engines

A Closer Look at the Raster & PolyMorph Engines


In the last few pages you may have noticed mention of the PolyMorph and Raster engines which are used for highly parallel geometry processing operations. What NVIDIA has done is effectively grouped all of the fixed function stages into these two engines, which is one of the main reasons drastically improved geometry rendering is being touted for GF100 cards. In previous generations these functions used to be outside of the core processing stages (SMs) and NVIDIA has now brought them inside the core stages to ensure proper load balancing. This in effect will help immeasurably with tessellated scenes which feature extremely high triangle counts.


Within the PolyMorph engine there are five stages from Vertex Fetch to the Stream Output which each process data from the Streaming Multiprocessor they are associated with. The data then gets output to the Raster Engine. Contrary to past architectures which featured all of these stages in a single pipeline, the GF100 architecture does all of the calculations in a completely parallel fashion. According to our conversations with NVIDIA, their approach vastly improves triangle, tessellation, and Stream Out performance across a wide variety of applications.

In order to further speed up operations, data goes from one of 16 PolyMorph engines to another and uses the on-die cache structure for increased communication speed.


After the PolyMorph engine is done processing data, it is handed off to the Raster Engine’s three pipeline stages that pass off data from one to the next. These Raster Engines are set up to work in a completely parallel fashion across the GPU for quick processing.


Both the PolyMorph and Raster engines are distributed throughout the architecture which increases parallelism but are distributed in a different way from one another. In total, there are 16 PolyMorph engines which are incorporated into each of the SMs throughout the core while the four Raster Engines are placed at a rate of one per GPC. This setup makes for four Graphics Processing Clusters which are basically dedicated, individual GPUs within the core architecture allowing for highly parallel geometry rendering.

Now that we are done with looking at the finer details of this architecture, it’s time to see how that all translates into geometry and texture rendering. In the following pages we take a look at how the new architecture works in order to deliver the optimal performance in a DX11 environment.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
The GF100’s Modular Architecture Scaling

The GF100’s Modular Architecture Scaling


When the GT200 series was released, there really wasn’t much information presented regarding how the design could be scaled down to create lower-end cards to appeal to a wide variety of price brackets. Indeed, the GT200 proved extremely hard to scale due to its inherit design properties which is why we saw the G92 series of cards stay around for much longer than was originally planned. NVIDIA was lambasted for their constant product renames but considering the limitations of their architecture at the time, there really wasn’t much they could do.

Lessons were learned the hard way and the GF100 actually features an amazingly modular design which can be scaled down from its high-end 512SP version to a nearly infinite number of smaller, less expensive derivatives. While NVIDIA really didn’t talk much about the design’s scalability during our briefing, dozens of emails were exchanged afterwards and a clearer picture of the GF100’s family is starting to take shape. In this section we take a look at how these lower-end products can be designed and we pop in a few educated guesses here and there for good measure.


The GPC’s: Where it All Starts

Before we begin, it is necessary to take a closer look at one of the four GPCs that make up a fully-endowed GF100.


By now you should all remember that the Graphics Processing Cluster is the heart of the GF100. It encompasses a quartet of Streaming Multiprocessors and a dedicated Raster Engine. Each of the SMs consists of 32 CUDA cores, four texture units, dedicated cache and a PolyMorph Engine for fixed function calculations. This means each GPC houses 128 cores and 16 texture units. According to NVIDIA, they have the option to eliminate these GPCs as needed to create other products but they are also able to do additional fine tuning as we outline below.


Within the GPC are four Streaming Multiprocessors and these too can be eliminated one by one to decrease the die size and create products at lower price points. As you eliminate each SM, 32 cores and 4 texture units are removed as well. It is also worth mentioning that due to the load balancing architecture used in the GF100, it’s possible to eliminate multiple SMs from a single GPC without impacting the Raster Engine’s parallel communication with the other engines. So in theory, one GPC can have one to four SMs while all the other GPCs have their full amount without impacting performance one bit.

So what does this mean for actual specifications of GF100 cards aside from the 512 core version? The way we look at this, additional products would theoretically be able to range from 480 core, 60 texture unit high-end cards to 32 core, 4 TMU budget-oriented products. This is assuming NVIDIA sticks to the 32 cores per SM model they currently have.

Since we want to be as realistic as possible here, we expect NVIDIA to keep some separation between their various product ranges and release GF100-based cards with either two or four SPs disabled. This could translate into products with 448(cores) + 56 (texture), 384 + 48, 320 + 40, etc for a wide range of possible solutions.


ROP, Framebuffer and Cache Scaling

You may have noticed that we haven’t discussed ROPs and memory scaling yet and that’s because these items scale independently from the GPCs.


Focusing in on the ROP, Memory and Cache array we can see that while placed relatively far apart on the block diagram, they are closely related and as such they must be scaled together. In its fullest form, the GT100 has 48 ROP units grouped into six groups of eight and each of these groups is served by 128KB of L2 cache for a total of 768KB. In addition, every ROP group has a dedicated 64-bit GDDR5 memory controller. This all translates into a pretty straightforward solution: once you eliminate a ROP group, you also have to eliminate a memory controller and 128KB of L2 cache.


Scaling of these three items happens in a linear affair as you can see in the chart above since in the GF100 architecture, you can’t have ROPs without an associated amount of L2 cache or memory interface and vice versa. One way or another, the architecture can scale down all the way down to 8 ROPs with a 64-bit memory interface.


A Note about Memory Sizes

Unfortunately, in our discussions NVIDIA was mum about the amount of memory we are likely to see on their flagship GF100 product so it is a bit hard to assume what the allotment would look like for scaled-down versions. What we can say is that on the demo card we were shown, there seemed to be traces for 12 memory modules around the GPU core so each of the 64-bit controllers has a pair of ICs associated with it. With 128MB modules being the preferred weapon of choice these days that could translate into 1.5GB of GDDR5 for the high-end version.

Meanwhile, the memory on lower end versions could scale in a linear fashion as well in accordance with the elimination of a 64-bit interface with every group of ROPs that is removed. So, the possibility of a 1.28GB, 320-bit card, a 1GB, 256-bit product and so on does exist but may or may not happen.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
Geometric Realism & Tessellation to the Next Level

Geometric Realism & Tessellation to the Next Level


In past pages we talked about the architectural advances that have been made in order to facilitate high-end, efficient performance. Now we get to see how all of these new features on the GPU work in concert to increase the overall performance of the GF100 in next generation games that will use tessellation and other DX11 features to attain geometric realism. NVIDIA claims up to an 8x increase in geometry processing horsepower versus their previous hardware generation.


Tessellation Performance


For those of you wondering what the tests above represent, the first three on the left show tessellation performance with an increasingly higher level of geometric complexity. The next two focus on situations that use a combination of geometry processing and DirectCompute operations to render a high-resolution scene and finally the last test is run on a direct draw call from Unigine’s tessellation engine.

We know from experience that while ATI’s 5000 series was the first on the market with DX11 compliance, it has some serious issues rendering scenes with more advanced DX11 features. NVIDIA meanwhile made an investment towards geometry performance which should vastly improve DX11 performance.


Unigine: Heaven Performance

The chart you see below is from the Unigine engine benchmark, 60 seconds is taken from the walkway section of Heaven benchmark that has the most tessellation.


NVIDIA states quite emphatically that with their GF100 series of cards, the developer can deploy a lot of geometry (tessellation) into a scene without a huge drop in performance. With the parallel workloads being performed, this new architecture works toward uniform performance instead of having sudden drops as seen with competing solutions. As we already mentioned, the HD 5000 series isn’t tailor-made for a DX11 environment while NVIDIA’s architecture was designed from the ground up to do just that.


Geometry Shader performance

NVIDIA showed off one more bit of information about DX11 performance which was from a Microsoft DX11 geometry shader toolkit. The results you see below were taken during the rendering of two separate passes which ask the GPU to process up to six cubemap faces in one pass.


Past APIs needed to have polygons processed on the CPU and then shipped over the PCI-E to the GPU which requires a huge amount of computing horsepower and is inefficient. DX11 allows all of the operations for higher-end functions like tessellation and geometry to stay on-chip, but NVIDIA takes things even further by ensuring the GPU core processes everything without even sending it to the GPU’s local memory.

Meanwhile, DX9 didn’t allow a developer to create geometry on the GPU which DX10 allowed for limited geometry to be created. However, now with DX11, developers can do so even though it takes massive horsepower to process it and compute the instructions. This is where the GF100’s parallel geometry processing architecture comes into play since it allows for efficient caching of data without having it run into a memory bottleneck.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
Improved ROP & Texture Performance

Improved ROP and Texture Performance


For the better part of this article we have been talking about what NVIDIA has done to increase performance in DX11 and attain that ever elusive “geometric realism”. Meanwhile, it is important to remember that ROP and Texture Unit performance also plays a huge roll in past, present and future games. There are several very popular current games such as Crysis and Left 4 Deal that take advantage of highly detailed textures while ROP performance is critical for performance scaling when anti-aliasing is enabled.


ROP Performance Details

Even though the GF100 architecture does feature significantly more ROPs than the GT200 (48 versus 32), NVIDIA did much more with these units than just add more of them to the die. Each of the six groups of eight ROPs is serviced by a single dedicated 64-bit memory controller for increased efficiency, but unlike other architectures the ROPs don’t have a dedicated cache. Rather, they make use of the shared 768KB of on-die L2 cache and can each output a 32-bit integer pixel per clock, an FP16 pixel over two clocks, or an FP32 pixel over four clocks. In plain English, this means the ROPs are far more flexible than those found on the GT200 architecture.


This new and improved ROP layout and design means a drastic increase in AA performance as you can see in the slide above. Where the GT200 architecture experienced a 60% drop in performance when going from 4x to 8x AA, the GF100 shows a mere 24% fallout. This minimal drop can also be chalked up to improved framebuffer efficiency as well.

With the GF100, it seems that we can expect to play games with extreme IQ settings enabled without having to worry about framerates tanking.


Texture Unit Performance Details

At the beginning of this article we mentioned that the GF100 architecture actually has less texture units than the GT200 (64 versus 80) which when taken at face value does seem concerning but there’s more to these GF100 texture units than what first meets the eye.

First of all, let’s refresh our memory about the GT200 texture unit layout and its specifications. Basically, the older architecture had multiple SMs sharing one texture unit which caused a data bottleneck when more than one made a request at the same time. In addition, the speed of the texture units was directly tied to the core clock. All of these points made the texture units on the GT200 perform quite well but they went about their jobs inefficiently.

With the GF100 architecture on the other hand, each SM has its own texture unit so multiple SMs don’t have to compete for the same texture cache. In addition, these new units run asynchronously to the core clock speed and are actually designed to run significantly faster than the core itself. This means a fair amount of scalability within the way the GF100 addresses textures. Additionally, the GF100’s texture units also include total support for DirectX 11’s BC6H and BC7 texture compression formats which are supposed to reduce the memory footprint of HDR textures and render targets.


What NVIDIA has set about to accomplish is the use of less texture units but increased per-unit performance in high texture situations. As such, even with less texture units, the GF100 is able to run circles around the GT200 in terms of high-level texture performance which bears out to a 60% increase in texture-only framerates for certain games.
 

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
Image Quality Improvements (Jittered Sampling)

Image Quality Improvements


Even though additional geometry could end up adding to the overall look and “feel” of a given scene, methods like tessellation and HDR lighting still require accurate filtering and sampling to achieve high rendering fidelity. For that you need custom anti-aliasing (AA) modes as well as vendor-specific anisotropic filtering (AF) sampling and everything in between. As the power of GPUs rapidly outpaces the ability of DX9 and even DX10 games to feed them with information, a new focus has been turned to image quality adjustments. These adjustments do tend to impact upon framerates but with GPUs like the GF100 there is much less of a chance that increasing IQ will result in the game becoming unplayable.


Quicker Jittered Sampling Techniques

Many of you are probably scratching your head and wondering what in the world jittered sampling is. Basically, it is a shadow sampling method that has been around since the DX9 days (maybe even prior to that) which allows for realistic, soft shadows to be mapped by the graphics hardware. Unfortunately, this method is extremely resource hungry so it hasn’t been used very often regardless of how good the shadows it produces may look.


In the picture above you can see what happens with shadows which don’t use this method of mapping. Basically, for a shadow to look good it shouldn’t have a hard, serrated edge.


Soft shadows are the way to go and while past generations of hardware were able to do jittered sampling, they just didn’t have the resources to do it efficiently. Their performance was adequate with one light source in a scene but when asked to produce soft shadows from multiple light sources (in a night scene for example), the framerate would take an unacceptably large hit. With the GF100, NVIDIA had the opportunity to vastly improve shadow rendering and they did just that.


To do quicker, more efficient jittered sampling, NVIDIA worked with Microsoft to implement hardware support for Gather4 in DX11. Instead of doing four texture fetches per cycle, the hardware is now able to specify one coordinate with an offset and fetch four textures instead of having to fetch all four separately. This will significantly improve the shadow rendering efficiency of the hardware and is still able to work as a standard Gather4 instruction set if need be.

With this feature turned on, NVIDIA expects a 200% improvement in shadow rendering performance when compared to the same scene being rendered with their hardware Gather4 turned off.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
Image Quality Improvements (32x CSAA & TMAA)

Image Quality Improvements



32x CSAA Mode for Improved AA

In our opinion, the differences between the AA modes above 8x are minimal at best unless you are rendering thin items such as grass, a chain-link fence or a distant railing. With the efficiency of the DX11 API in addition to increased horsepower from cards like the GF100, it is now possible to use geometry to model vegetation and the like. However, developers will continue using the billboarding and alpha texturing methods from DX9 which allow for dynamic vegetation, but it will continue to look jagged and under-rendered. In such cases, anti-aliasing can be applied but high levels of AA are needed in order to properly render these items. This is why NVIDIA has implemented their new 32x Coverage Sample AA.


In order to accurately apply AA, three things are needed: coverage samples, color samples and levels of transparency. To put this into context, GT 200 had 8 color samples and 8 coverage samples which means a total rate of 16 samples on edges. However, this only allowed for only 9 levels of transparency. This lead to edges which still looked jagged and without proper blending so dithering was implemented to mask the banding.

The GF100 on the other hand features 24 coverage samples and 8 color samples for a total of 32 samples (hence the 32x CSAA moniker). This layout also offers 33 levels of transparency for much smoother blending of the anti-aliased edges into the background and increased performance as well.


With increased efficiency comes decreased overhead when running complex AA routines and NVIDIA specifically designed the GF100 to cope with high IQ settings. Indeed, on average this new architecture only loses about 7% of its performance when going from 8x AA to 32x CSAA.


TMAA and CSAA: Hand in Hand

No matter how much AA you apply in DX9, there will still invariably be some issues with distant, thin objects that are less than a pixel wide due to the method older APIs use to render these. Transparency Multisample AA (TMAA) allows the DX9 API to convert shader code to effectively use alpha to coverage routines when rendering a scene. This, combined with CSAA, can greatly increase the overall image quality.


It may be hard to see in the image above but without TMAA, the railing in the distance would have its lines shimmer in and out of existence due to the fact that the DX9 API doesn’t have the tools necessary to properly process sub-single pixel items. It may not impact upon gaming but it is noticeable when moving through a level.


Since coverage samples are used as part of GF100’s TMAA evaluation, much smoother gradients are produced. TMAA will help in instances such as this railing and even with the vegetation examples we used in the last section.
 
Status
Not open for further replies.

Latest posts

Twitter

Top