What's new
  • Please do not post any links until you have 3 posts as they will automatically be rejected to prevent SPAM. Many words are also blocked due to being used in SPAM Messages. Thanks!

GPU Benchmarking Methods Investigated: Fact vs. Fiction

Status
Not open for further replies.

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
13,421
Location
Montreal

GPU Benchmarking Methods Investigated: Fact vs. Fiction





Benchmarks. Every website worth their marbles uses them to varying degrees of accuracy. Meanwhile, every reader wants to recreate them in some way, shape or form in order to do exactly what their favorite publications are doing: to evaluate the performance of their hardware choices and quantify their purchase. Benchmarks can also help diagnose a problem but more often than not websites like Hardware Canucks use these tools to determine how well a given product performs against the competition. As with all things, the number of programs we can attain results with is nearly infinite but it is the job of publications to choose the right set of tools which will accurately convey results to the masses. Unfortunately, as we will show you in this article choosing the right programs and sequences is extremely hard and most of the current methods are inaccurate.

The reason why we have chosen to focus on GPU benchmarking is because this really is the wild-west of the online review industry. A fortune in terms of traffic can be had if GPU reviews are published regularly but with potential traffic increases comes the risk of cutting corners in order to complete the time-consuming benchmarking portion as quickly as possible. Naturally, some time-cutting methods will still produce accurate results while others won’t.

In a general canvassing of over two dozen English-speaking tech websites we found a wide swath of benchmarks being used; from timedemos to stand-alone programs to in-game benchmarks to walkthroughs. What we also saw at times was a general lack of information beyond a game’s title regarding the actual type of benchmark used. For the most part it seemed many websites were using in-game benchmarking tools (mostly “rolling” demos) instead of actual gameplay and coming up with some interesting results. This along with comments in several forums got us wondering: is there a “right” way to benchmark a particular game? In addition, do these in-game or stand-alone benchmarking programs –like the recently released AvP DX11 test- represent in-game performance? If not, do they even provide an accurate enough analysis for a writer to formulate a conclusion about a given product? Well, we’re about to find out.

In this article we are going to take nine of the most popular games used by most websites for GPU reviews and give you a rundown of their performance in-game and otherwise. In most cases we will be highlighting the usefulness of either stand-alone or on in-game benchmarks simply because they are easily accessible to reviewers and the general public alike. There will also be some discussion about how timedemos, sample lengths and patches can affect results.

We will be using a GTX 470 and a HD 5850 for these tests in order to determine if different benchmarking methods will affect the positioning of each product. Meanwhile, every game was played through from start to finish (yes, this article has been a long time in the making) and we have determined a worst-case sequence as well as a more “typical” scene from which we will be basing our real world numbers on. Meanwhile, for comparison purposes we will also be testing the additional benchmarking features each of these games comes with.

Before we go on, it is important to preface this article with one statement: we aren’t looking to point fingers in any way, shape or form. Our aim is to give readers enough information so they can determine which results are accurate and which are not.

Our thanks to Toms Hardware, Anandtech & PCGameshardware for helping out with validating results / methodologies for this article.

 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
13,421
Location
Montreal
Testing Methodology & System Setup

Testing Methodology & System Setup



Processor: Intel Core i7 920(ES) @ 4.0Ghz (Turbo Mode Enabled)
Memory: Corsair 3x2GB Dominator DDR3 1600Mhz
Motherboard: Gigabyte EX58-UD5
Cooling: CoolIT Boreas mTEC + Scythe Fan Controller (Off for Power Consuption tests)
Disk Drive: Pioneer DVD Writer
Hard Drive: Western Digital Caviar Black 640GB
Power Supply: Corsair HX1000W
Monitor: Samsung 305T 30” widescreen LCD
OS: Windows 7 Ultimate N x64 SP1


Graphics Cards:

NVIDIA GTX 470 (Reference)
Sapphire HD 5850 1GB (Stock)


Drivers:

ATI 10.5 WHQL
NVIDIA 257.17 Beta



*Notes:

- All games tested have been patched to their latest version

- The OS has had all the latest hotfixes and updates installed

- All scores you see are the averages after 3 benchmark runs

All IQ settings were adjusted in-game or by use of a config file

All games and benchmarks were recorded with FRAPS instead of relying on any other indicators for performance.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
13,421
Location
Montreal
In-Game Benchmarks: Batman AA / DIRT 2

Batman: Arkham Asylum


Batman: AA is a game we have been seeing a lot of in benchmarks since its release. This is likely due to its in-game benchmarking tool’s quick sequence that spits out a result in under a minute.

For this benchmark, we are comparing two game sequences to the in game benchmark. These two sequences are as follows:

Typical: Challenge Mode -1st Challenge. ~3 mins length
Worst Case: Botanical Gardens ~3 mins. length



Since it only uses a simple floating camera method, we can see that the in-game benchmarking tool isn’t at all representative of a real-world gameplay situation.

When playing the game, there is usually several high-detail character models on-screen at the same time which typically adds to the stress put on the GPU. Since for the most part this benchmarking tool lacks those items, its results are abnormally high and will likely give users the wrong impression regarding which products are able to achieve playable framerates. With that being said, it does use in-game levels so the actual separation percentage between the NVIDIA and ATI cards remains similar to an actual gameplay sequence.


DiRT 2


DiRT 2 is on of the only games we have in this article which uses a benchmarking sequence depicting actual gameplay. The setting is the demanding, close-quarters London; Battersea Air track that features everything from waving crowds to water to sunlight refraction.

For this benchmark, we are comparing two game sequences to the stand-alone and in-game benchmark. These two sequences are as follows:

Typical: Utah Trail Blazer
Worst Case: Battersea: Air – One lap



In our opinion, the IN-GAME benchmark is highly accurate and actually one of the best we have ever seen in terms of consistent results. Granted, if you are a reasonably good player you will likely be far out in front of the competition so performance will likely be higher without five or six cars in front of you but that’s why we consider the benchmark a worst case yet realistic gameplay scenario.

On the other hand, the stand alone benchmark is a bit of a disaster since the only way to force DX11 mode is through a customized config file and the performance on certain cards is far from optimal. One of the reasons behind this shortfall is that the game updates are not being carried over into the stand-alone benchmark’s engine. This results in performance which is far off base from what you would experience in the game itself.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
13,421
Location
Montreal
In-Game Benchmarks: Far Cry 2 / HawX

Far Cry 2


Far Cry 2 is one of those games that seem to have stuck around through thick and thin within benchmarking circles simply because it makes excellent use of hardware by displaying stunning environments while packing a robust benchmarking tool. The tool which is built into the game allows for standard fly-throughs to be run as well as the option to upload your own custom demo.

The “worst case” scenario we took is an assault on a village of enemies using our trusty flamethrower and AK-47 to wreak havoc. Meanwhile the “typical” scene we used incorporates exactly what you will be doing for three quarters of this game: combating a few enemies amid rolling fields of grass.



The built-in benchmarking tool is just what the doctor ordered in terms of usability but as you can see, the results from the three standard benchmarking scenes vary wildly from one to the next. In our books, the Ranch Long is the most accurate due to its longer sampling time but we will stick to our custom playback. One thing that is good to see is that neither NVIDIA nor ATI seem to have any additional optimizations running in for the built-in benchmarks that aren’t available in the game itself.


HawX


This is one game which has always presented us with a bit of an issue when it comes to benchmarking. While HawX does include its own built-in benchmark, there is such a wide variety of levels in the game that replicating overall performance with a single test is next to impossible.

The benchmark that’s included has the player flying over Rio de Janeiro as rebels attack from every possible corner. This is actually taken directly from one of the game’s first missions and along with later levels in Tokyo Bay and Chicago represents one of the most demanding missions possible.

Here is what we used for a comparison against the in-game benchmark:

“Typical” Gamplay: Appalachian Mission; ~3 mins flying + combat
“Worst Case” Gameplay: Rio de Janeiro; ~3 mins flying + combat



As we have already seen, there are good in-game benchmarks out there but there are some pointless ones as well. We have to give the HawX team some serious credit because their benchmark replicates the actual mission it is based on extremely well. The result is an application that both users and technical writers can use with confidence when it comes to determining GPU performance. This goes to show you that using an in-game benchmark with actual gameplay can pay off in spades even though it portrays what is essentially a scenario with slightly more action taking place than the game itself does.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
13,421
Location
Montreal
In-Game Benchmarks: Just Cause 2 / Resident Evil 5

Just Cause 2


Just Cause 2 may be one of the newer games on this list and uses the DX10 API to good effect….which of course makes it quite the challenge for today’s graphics cards. It also incorporates three built-in benchmarks that showcase different parts of the game. Unfortunately, none of these benchmarks is based on realistic in-game situations as they use the usual flythrough method without any characters or vehicles on the screen.

With that being said, finding a typical benchmarking area in Just Cause 2 is quite easy because of the game’s sandbox nature resulting in a totally open world for you to explore without needing to proceed in a linear fashion through the missions. Here are the two areas we picked for comparison against the included benchmark scenes.

“Typical” Gamplay: Panau Village (assault on soldiers in town); ~4 mins length
“Worst Case” Gameplay: Casino Assault (getaway scene on car’s roof); ~4 mins length

Naturally, all NVIDIA-specific visual options like CUDA water and the Bokeh Filter were disabled.



While the “Dark Tower” benchmark does tend to almost come close to the overall results from our in-game testing, it still doesn’t represent real-world performance in our books simply because it uses a flythrough. Nonetheless, it is far from the worst offender. The other in-game benchmarks are simply out to lunch when it comes portraying gameplay and should pretty much be ignored if you want accurate results.

What frustrates us the most about the Just Cause 2 in-game benchmark situation is the fact that the team at Square Enix could have easily used a gameplay sequence as one of the scenes. Instead, they chose to go with three “eye candy” situations that do nothing other than look pretty.


Resident Evil 5


When NVIDIA released their “Big Bang” drivers last year, Resident Evil 5 was given as one of the shining examples of a new game that benefited from performance increases. To editorialize a bit, I personally think the game itself horrible but it still looks great, uses the DX10 API and provides a reasonably good workout for today’s GPUs.

To make matters even better for would-be reviewers, Capcom included two in-game benchmark sequences which are supposed to give users an idea of performance. One of these is what is called a “Fixed” benchmark that uses a single free-camera scene without any gameplay while the other “Variable” benchmark incorporates four scenes of which three are actual gameplay.

For the sake of comparison, we will be using two different gameplay sequences:

“Typical”: Level 5-2; mostly walking through streets with a few enemies
“Worst Case”: Level 2-1; Bridge scene with an out of control truck, explosions and numerous enemies



The Variable benchmark does look like it gives an almost-accurate representation of gameplay performance but the chart above doesn’t tell the whole truth. In the game itself, there is a wide gap between the NVIDIA and ATI cards while the benchmarks result in very similar performance between the two solutions.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
13,421
Location
Montreal
Stand-Alone Benchmarks: AvP / STALKER CoP

Aliens vs. Predator


A few weeks ago, Rebellion released a stand-alone DX11 benchmarking tool for this game which used a highly tessellated yet short sequence using the in-game engine. Many have reported shockingly low performance but that’s to be expected considering the scene used.

For this benchmark, we are comparing two game sequences to the stand-alone benchmark. These two are as follows:

“Typical” Gamplay: Predator campaign – Swamp mission. ~4 mins length
“Worst Case” Gameplay: Marine campaign – Refinery mission. ~4 mins length

All of the graphics settings were directly modified in the benchmark’s and game’s config file so they perfectly mirrored one another.



The results here personify the reasons why stand-alone apps sometimes can’t tell the whole story. After playing through the whole game with every species, we couldn’t find anything more than a few scattered 3 to 5 second segments that even come close to the pressure this benchmark puts upon your graphics card.

In addition, while the game itself shows a reasonably wide space between the ATI and NVIDIA cards, the benchmark has them running neck and neck. In our opinion this gives a seriously wrong impression about how a given product will perform when actually playing Aliens versus Predator.


STALKER: Call of Pripyat


When it was first released, the latest STALKER game was hailed for its use of DX11 but people quickly realized for all its bluster, the graphics were decidedly mediocre at the best of times. Nonetheless, the stand-alone benchmark that was released prior to the game’s North American debut is still used by many people due to its comprehensive interface and seemingly consistent results.

Within the stand-alone benchmark, there are four separate sections which are each supposed to represent a different time of day in the game world. In order to compare in-game performance to these, we used two of the same situations the benchmark uses in a single run: daytime conditions with a few sunshafts thrown in at dawn. Our “Typical” sequence used an indoor / outdoor combination in the wastes while the “Worst Case” scenario was done in the city of Pripyat. Both tests were done with a typical three-minute walkthrough including some combat sequences.



Unfortunately, due to several differences between the benchmark’s options and those in the game itself, we had to replicate the settings as best we could. We are however confident that maxing out the settings in both applications resulted in what should have been aligned performance.

It looks like once again we have a situation where the stand-alone benchmark is not telling the whole truth about our tested cards. Somehow, it shows the HD 5850 clearly beating the GTX 470 in the minimum framerate department in some cases but even after several tries, we couldn’t replicate this result in the game itself. Nearly the same observation can be made about the HD 5850’s average framerates since the benchmark showed a massive gap between it and the GTX 470. In reality and in-game, the difference between the two cards was actually quite small.

In our opinion, the STALKER stand-alone benchmark is a welcome tool but not something that should be used for comparative articles where conclusions are usually based upon how well a card performs WITHIN a game.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
13,421
Location
Montreal
A Case for Timedemos

A Case for Timedemos


The ability to record and play back a timedemo is something many current games lack and the few games which have this feature are becoming extremely rare. They offer reviewers a means to use repeatable in-game sequences without having to resort to the sometimes-inaccurate run-throughs or (as we saw in the previous pages) play Russian Roulette with a stand-alone benchmark.

There has been some criticism levelled at timedemos for their supposed inability to accurately depict a gameplay situation due to the fact they run as fast as the famerate will allow them to. Does this perceived limitation have any impact upon the results they generate versus a true gameplay runthrough? In order to find out, we took Left 4 Dead 2 and Far Cry 2’s timedemo functions out for a test run.

In L4D2 we used our usual benchmarking sequence in the Atrium level while Far Cry 2 used our Village Assault sequence. In order to determine accuracy, on the first run-through we recorded the timedemo while on the second time around FRAPS was used to record the framerates. FRAPS was also used to determine framerates of the timedemo playback. Remember, these two sequences have been used for over two dozen reviews at multiple resolutions so we are well versed in ensuring one run to the next is virtually identical.


The difference between the timedemo and actually gameplay sequence in Far Cry 2 is marginal at best but then again, we have already shouted from the rooftops how good its built-in benchmarking tool is. For a bit of a different perspective, let’s move on to Left 4 Dead 2.


We see a bit more variation in this comparison than we did in Far Cry 2 but once again we don’t think a sub-3% difference is any reason to totally discount a benchmarking method. For all intents and purposes it looks like the few games that support timedemos provide accurate playbacks.

One of the main issues we have with timedemos is that so few games actually support them. Not only do they make life a whole lot easier (and more repeatable) from a reviewing perspective but they also allow publications such as Hardware Canucks to share our timedemos with our readers. This sharing could add a whole new dimension of transparency to online reviewing and help consumers make the right decision when it comes to buying products. We just hope more games will support this feature in the near future.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
13,421
Location
Montreal
Patches: A Gamer’s Best Friend, A Reviewer’s Nightmare

Patches: A Gamer’s Best Friend, A Reviewer’s Nightmare


When a developer updates their game, people usually rejoice in the fact that some bugs might actually be fixed. What many may not realize is that some pretty significant performance improvements may also be wrapped into a game update.

In the past, consumers usually needed to search high and low for game updates. Now applications like Valve’s Steam are providing services whereby games can be automatically updated in an easy and usually non-intrusive way. Unfortunately, many a reviewer out there will completely disable any updates (don’t deny it guys, you know you do) in order to avoid rebenching games every time a new update is released. We’ll even admit to doing this up until around 12 months ago when we finally saw the light so to speak.

Below, we have two case studies dealing with performance improvements we have experienced over the last few months. Both games were benchmarked using the same methods used elsewhere in this article and the non-Steam versions were used so we could apply the patches manually.


Aliens versus Predator



AvP has undergone a number of updates throughout the time it has been on the market and every single one of them has instituted some form of performance improvement. The first patch was actually rolled out on the day of the game’s February release and instituted a massive number of performance improvements as you can see above. For the sake of this article, we will assume everyone has this patch installed (hopefully) and use its performance as a baseline.


The second patch for this game was released about a month later in March 2010 and once again incorporated additional performance improvements when tessellation was enabled. Rolled into this update was a welcome addition for NVIDIA DX11 card users since it fixed a bug where the game was excessively dark.


The final patch was rolled out just a few weeks ago and once again makes some small adjustments to DX11 tessellation performance while fixing some rendering issues on NVIDIA’s latest cards. So what is the result of these two last updates?


Even though it isn’t a massive difference, the patches that have been applied to AvP definitely had an impact upon performance. The most interesting aspect was that prior the two latest patches, both the HD 5850 and GTX 470 had similar minimum framerates but it looks like the numerous DX11 optimizations made by the developers has benefited NVIDIA’s architecture more than ATI’s (at least in our sequence).


DiRT 2


DiRT 2 isn’t a game that has been issued a large number of patches but the one which was rolled out back in March of this year encompassed some major changes and optimizations.


One of the main concerns when DiRT 2 was first released was the massive performance hit when anything above Low shadow quality was enabled on even mid-range GPUs. Even higher end DX11 capable GPUs like the HD 5800 series had issues with High shadow settings. This was addressed with the first patch which allowed some serious performance increases at higher shadow settings (we use Ultra for all benchmarks) along with better multi core CPU support and a few other touches here and there.


The performance increases when using the newer patch are nothing short of shocking with both cards pulling increases of greater than 10%.


As we saw above, benchmarking with the latest patch is a must especially in a time when it seems game developers are able to improve performance of their own games at a quicker rate than NVIDIA and ATI do so with driver revisions. DiRT 2 in particular provided us with a perfect case study to show how a few simple patches could literally change one’s perception of a given GPU.

In addition, one of the most important things to remember is that if comparison testing is done, ALL graphics cards should be benchmarked using the same version of the game engine. If not, there is a very real risk of having completely skewed results.


In short, a conclusion should never be based upon results gleaned from stand-alone benchmarks while in-game benchmarks should only be used if their results line up with those from an actual gameplay sequence. This is why we believe that it is imperative publications state exactly what tools they are using for their benchmarks.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
13,421
Location
Montreal
The Importance of Sampling Length

The Importance of Sampling Length


The reasoning behind sample length is simple: to get an accurate performance representation by logging a suitable number of data points. Unfortunately, sample length is a bit of a misnomer when games can offer dozens of hours worth of play time and levels. The trick is to find benchmarkable levels with a good balance between a worst case scenario and a typical gamplay sequence. There is a problem even with this approach since many have decided to use a mere fifteen to thirty seconds of gameplay or a built-in benchmark for their comparisons. The results are shown below.


Aliens Versus Predator





The first example we are taking is part of our usual gameplay sequence and the results pretty much speak for themselves. Using a sample length of 30 seconds or less results in performance numbers that are relatively close when it comes to the difference between the two cards in average framerates. Unfortunately, this whole “using the first 30 seconds” of a level doesn’t tell the whole truth in terms of minimum framerates.

Looking closer at the frametime graph, there are quite obviously many 30 second chunks where performance is completely at odds with the final result.


Just Cause 2





The Dark Tower benchmark gives us an idea of what would happen if someone only used the first few dozen seconds of a built-in benchmark. We see an issue where the two cards may be performing similarly in one (supposedly CPU limited) section of water physics calculations but then later in the test, the more intense graphics processing allows the separation to grow significantly.


Obviously, proper sampling length is a must. All too many times we have seen sub-30 second snippets being used for benchmarking purposes and this is worrying. Which would you trust more: a survey done with a mere 10 people or one done with a 1000 person cross-section? The simple fact of the matter is that 30 seconds just doesn’t give enough gameplay time to determine performance or even log an accurate minimum framerate.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
13,421
Location
Montreal
Final Thoughts; Lessons Learned

Final Thoughts; Lessons Learned


So here we are after more pages than this article had any right being and believe it or not our conclusion is actually pretty straightforward: there is no “perfect” way to benchmark a game. There are just too many variables that can affect the outcome and it is impossible to take all of them into account. However, I think we’ve proven there are methods and methodologies which should be avoided at all costs in order to ensure accurate results.

Let’s begin with the most popular way of benchmarking games: in-game or stand-alone rolling benchmarks that spit out a result with minimal user involvement. In general, they don’t mean jack all when it comes to determining a GPU’s performance within the game itself. Stand alone benchmarks are heavy offenders for a number of reasons including a lack of patches and sequences which aren’t at all representative of in-game conditions. In-game rolling benchmarks receive all of the necessary game engine patches but more often than not still fall short when it comes to displaying actual gameplay. However, there are currently a small number of games like DiRT 2 and HawX which incorporate benchmark sequences that accurately recreate in-game scenarios. In this category we believe stand-alone benchmarks should be avoided altogether while in-game benchmarks should only be used if they represent actual gameplay and don’t include a “flythrough”.

Next up we have timedemos. For the most part we have seen accurate results when timedemos are compared to in-game sequences and it is a shame we are seeing less and less games with the possibility of recording and playing back these sequences. One of the most important aspects of timedemos is their ability to accurately repeat the exact same sequence over and over again. This is invaluable for benchmarking purposes since even a manual run-through can’t be exactly repeated every time regardless of what some would have you think.

The most important thing about accurately benchmarking a game comes down to one word: research. It isn’t often that the first level or in-game benchmark will give an accurate representation of GPU performance. Using one of these aforementioned methods could lead a writer to come up with the wrong conclusion if he takes the easy way out. Knowing the game one is testing through actual playing time is essential. There is no way anyone should be basing a conclusion off of games they aren’t totally familiar with because as we have seen, the risk of projecting the wrong information through incorrect benchmarks is very high indeed. Many wrongfully think it is fine to load up a few stand-alone benchmarks, hope the results line up with gameplay and be done with it.

This all boils down to one thing: transparency. There are too many times where publications will throw up a GPU review while making no mention of the levels being used or even whether their results are from a built-in or stand alone benchmark. A conclusion should never be based upon results gleaned from stand-alone benchmarks while in-game benchmarks should only be used if their results line up with those from an actual gameplay sequence. This is why we believe that it is imperative publications state exactly what tools they are using for their benchmarks.

To counteract this air of “secrecy” we used to exemplify, we have launched our Guide to the Hardware Canucks GPU Benchmarking Process. We suggest you check it out since it takes the lessons learned throughout the course of this article and opens up our benchmarking process to the public.

If anything we hope this article allows you to look at reviews and benchmarks in general with a more critical eye. With the ability to influence the buying decisions of consumers publications need to invest the time necessary to ensure their readers are getting the best possible information. Some of these methods may take a lot of time, but in the highly controversial world of graphics card reviewing, things need to be done right and discussed openly.



 
Last edited:
Status
Not open for further replies.

Latest posts

Twitter

Top