What's new
  • Please do not post any links until you have 3 posts as they will automatically be rejected to prevent SPAM. Many words are also blocked due to being used in SPAM Messages. Thanks!

AMD Kaveri A10-7850K & A8-7600 Review

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
AMD has been talking about Heterogeneous System Architecture or HSA for what seems like ages now but, with the launch of their Kaveri APUs, those plans are finally coming to fruition. Kaveri doesn’t represent a dramatic departure from previous generations though. It is simply another stepping stone, though a significant one, towards what AMD hopes will be a user and developer environment which embraces their approach.

In order to understand what makes Kaveri special, learning the basics of HSA is essential. In a nutshell HSA is an effort to leverage the potential CPU and graphics horsepower within AMD’s APUs by properly routing parallel and serial workloads towards the resources best able to process them. You see, the x86 cores excel in serial and task parallel scenarios while data parallel workloads can be handled much more efficiently by the GPU’s multiple compute cores. Since an Accelerated Processing Unit combines both x86 cores and a dedicated graphics subsystem, it’s perfectly suited for both situations. The challenge has always resided in developing a synergy between these two seemingly disparate elements. That’s where Kaveri comes into the equation.


Kaveri is based off of GlobalFoundries’ new 28nm Super High Performance (SHP) process node which represents a dramatic shift away from the 32nm APUs of Trinity and Richland. The move to what AMD calls an “APU optimized” 28nm process has allowed them to retain the dual module, quad core x86 computing layout while significantly expanding Kaveri’s feature sets and incorporating a more capable GPU through higher transistor density. All told, one of these new APUs will weigh in at 2.41 billion transistors spread across a die area of 245mm² delivering up to 856 GFLOPS of combined performance.

Unfortunately, 28nm SHP has some shortcomings. It is a relatively immature process node so leakage is slightly above where AMD wanted it to be. This meant sacrificing clock speeds at higher TDP levels but its affect on mid to entry level parts will be minimal at most.

One benefit of AMD’s 28nm approach is that it avoids the stacked transistors of Intel’s 3D lithography technology. While this does tend to increase the die are of these APUs versus their Haswell competitors, heat dispersion will be much easier which should lead to more efficient cooling with today’s heatsinks.

What you really need to know about Kaveri boils down to two words: Steamroller and GCN. Steamroller represents the latest iteration of AMD’s Bulldozer microarchitecture and includes several optimizations for better concurrent data throughput. Don’t expect titanic improvements over Piledriver but it does bring some much-needed single thread performance boosts to the table.

The addition of GCN or Graphics Core Next is a key element here; and not only because the GPU component takes up a whopping 47% of Kaveri’s available die space. It boasts significant benefits over the older VLIW4-based cores within Trinity and Richland and in this iteration even incorporates the “GCN 2.0” features we saw on the new R9 290-series Hawaii GPUs.


In keeping with current market trends, AMD hasn’t designed Kaveri for enthusiasts but rather targets segments that have further reaching impact. This means their primary design goals revolved around three core principles: create a winning notebook solution, deliver optimal performance per watt and implement a solution that can scale as necessary into other segments. As we’ve already mentioned, this simply fits with the APU’s current evolutionary process.

While history has shown that designing a one-size-fits-all solution is challenging, several of the new desktop parts seem to show many of these targets have already been achieved. If anything, Kaveri will have serious implications as its design trickles down into the notebook and ultra mobile spaces where its power efficiency can be used to the fullest effect.


With so much real estate being reserved for the graphics subsystem, it should go without saying that AMD is looking to maximize GPU computing on Kaveri. Not only is this one of the cornerstones of their HSA initiative but it actually meshes quite well with the current and upcoming market realities. It is also why AMD still believes that a quartet of x86 cores remains a “sweet spot” and doesn’t see the need for giving up valuable die space for an additional two-core CPU module.

As content consumption and creation are now being given equal values by many consumers, the GPU’s resources are needed for applications that can benefit from it massively parallel nature. Media playback, multimedia editing and gaming alongside accelerated UI features have all been given priority in Kaveri’s design but actually getting these to function correctly and efficiently on a balanced architecture is really the final frontier.


Coming back to our points about synergy, AMD has an architecture which is adaptable to both CPU and GPU workloads but they are also providing developers and programmers the software tools they need to leverage this hardware advantage. This all-in-one solution is the only way they’ll be able to achieve broad acceptance for HSA and the potential advantages it brings to the table.

On the hardware side, uniform memory access between components and workload dispatch equality bring a truly heterogeneous environment that much closer to the table as the CPU and GPU can share on-die resources. Meanwhile, the key to actually unlocking the architecture’s potential horsepower lies with software which, in this case, takes the form of AMD’s unified software development kits and their new CodeXL developer suite. We’ll take a look at each of these individually a bit later.

As it currently stands, Kaveri isn’t meant to compete against Intel’s higher end Haswell models, nor will it be priced over $200. AMD is firmly planted on the value end of the spectrum but has still thrown in some features that will appeal to enthusiasts but as you’ll see on the next page, the most interesting facet this new architecture isn’t necessarily its flagship APUs. Rather, the highly efficient and still powerful mid-range SKUs will likely be the ones which hold the most exciting elements for anyone reading this article.

 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
Meet The A10 7850K, A10 7700K & A8 7600

Meet The A10-7850K, A10-7700K & A8-7600


As with previous generations, Kaveri APUs will hit a number of different price point though at least initially the lineup will be limited in scope to just three products: the A10-7850K, A10-7700K and A8-7600. This is quite different from the Trinity and Richland launches which saw a broader spectrum of offerings. Whether or not a more targeted approach will work better than casting a wide net is anyone’s guess right now, focusing a three primary APUs will surely be less confusing for potential buyers than Intel’s current 35 processor Haswell lineup.

Speaking of Haswell, as we stated on the previous page, Kaveri isn’t meant to trade blows with the likes of Intel’s expensive i7 series in purely CPU-bound tasks. With the flagship A10-7850K starting at just $175, AMD’s more balance approach may lose them some positions in traditional benchmarks but from a value perspective and with a ton of GPU resources propping up the architecture, it’s hard to argue against Kaveri.

As you look through the specifications below, one thing will become evident: AMD has take a slight step backwards in the clock speed department. While this is unfortunate considering all of the IPC increases build into the Steamroller architecture and the improvements brought forward by the GCN graphics design, it’s more than obvious GlobalFoundries’ 28nm SHP process hasn’t yielded the hoped-for results.


In the chart above, you’ll likely notice something a bit different: the addition of so-called “Compute Cores”. In AMD’s new terminology, the total number Compute Cores is created by taking the number of CPU cores (in this case four) and adding that to the number of graphics Compute Units (between six and eight depending on the APU). It may seem confusing but due to the Kaveri’s closer methods of communication between these two primary on-die components, it likely made sense to marketing types.

Starting at the top of AMD’s Kaveri lineup is the aforementioned A10-7850K. Its Turbo and Base clock of 4GHz and 3.7GHz respectively are down by a substantial amount in comparison to the A10-6800K, as is its price of $175. It even carries forward the same 4MB of L2 cache (or 2MB per Steamroller module). Where the real differentiation lies is with the graphics processor: it boasts an R7 series classification with a core count of 512 which lies between an R7 250 and R7 260X. Naturally, its lower clock speeds and access to about 1GB of system DDR3 will ensure raw performance lies closer to the R7 250 but that’s still some incredible potential for an integrated solution. All of this is wrapped into a package rated at a TDP of just 95W. Plus, with the K-series designation and unlocked multiplier, it should appeal to overclockers as well.


The A10-7700K is meant to dominate Intel’s i3-4330 at the $150 to $160 price point but it does run at substantially lower frequencies than the A10-6700, the APU it’s meant to replace. It too is equipped with an R7-series graphics core but in this case it comes equipped with 384 SIMDs. Even though this APU has notably reduced specifications versus the A10-7850K, it still carries a TDP of 95W but as a budget-friendly overclocker for HTPC users, AMD may have a home run.

While the A10-7850K and A10-7700K are the high flyers of AMD’s new APU lineup, the A8-7600 is arguably its bread and butter. Of the three Kaveri SKUs, this is the desktop part most system integrators will likely choose for their upcoming products and also an excellent option for DIYers who want an efficient, low cost option for HTPCs or SFF builds. Its on-paper CPU specs are a hair behind the more expensive A10-7700K but this $120 APU still boasts a 384-core graphics payload despite a lack of unlocked multipliers. Unfortunately, it will only be launching about a month after its bigger brothers.


The A8-7600 has one more trick up its sleeve: a fully configurable TDP. With the flick of a BIOS setting, the built-in power management routines can either be unlocked completely or capped at 65W or 45W. In our testing, the “Disabled” option performed identically to the 65W setting which means some limiters are still in place but the really interesting addition here is the 45W setting.

At 65W, the A8-7700 is already quite efficient and outputs a minimal amount of heat but 45W brings these characteristics to a whole other level. Fans can be turned down to their quietest setting and the whole system will sip a minimum amount of power, through the sacrifice of CPU frequencies of course. GPU clocks on the other hand remain the same so gaming, multimedia and compute performance really won’t be touched. In short, this APU can become an HTPC user’s dream come true: performance when it’s needed coupled with absolute silence in every other scenario.

There is one minor caveat to this option: you need full access to the system’s BIOS to modify the Target TDP. At this time, AMD doesn’t offer this option within their OS-based Control Center software. There’s also no indication of how system builders will sell their wares; will there be an indication whether or not a particular setup is set to 45W? AMD couldn’t tell us.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
Kaveri's Platform & (Some) Backwards Compatibility

Kaveri’s Platform & (Some) Backwards Compatibility


As with many other AMD platforms, there is a lot of commonality between the previous generations and the iterations shipping with Kaveri. However, amongst all of the items being carried forward, one major thing has changed: the socket. Gone is the FM2 socket that’s been around since Trinity and in its place is the so-called FM2+.


FM2+ shares much of it base pin-outs with FM2 which means FM2-compatible Trinity and Richland processors are forwards compatible with the new motherboards. Unfortunately, due to a number of design changes, pathway revisions and component requirements, Kaveri processors will NOT be backwards compatible with older motherboards. However, any heatsink or water cooler that fit on an FM1 or FM2 socket will be compatible with FM2+.

With that out of the way, let’s get to the real meat of this section. AMD’s new motherboards will come in three different flavors: A88X and A78. These take over the high, mid level segments respectively from the A85 and A75 with the A55 designation being carried over to the FM2+ socket.


Most buyers will likely gravitate towards the already-available A88X platform since it combines an impressive number of features into motherboards that retail for as little as $70. Naturally, there will be quite a few more expensive options that target enthusiasts looking for things like Crossfire compatibility but for the most part, the A88X motherboards will cost significantly less than their Intel counterparts.

There really haven’t been all that many changes rolled into the new Bolton D4 / A88X Fusion Controller Hub since its predecessor, the A85, already featured a host of up-to-date connectivity standards. The A88X acts as an all-in-one I/O workhorse by controlling the system’s storage subsystem, BIOS, USB ports and a portion of the PCI-E devices. The only noteworthy departure is the inclusion of a DotHill RAID solution which compliments AMD’s RAID Xpert while also adding proper SSD TRIM pass-through.

The FCH connects to the APU via a 2 GB/s Unified Media Interface with the Kaveri architecture finally brining PCI-E 3.0 compatibility to the table. Graphics cards can connect to the motherboard via a single x16 PCI-E 3.0 connection or a pair of x8 lanes should a dual GPU setup be preferable. At this time, only Crossfire is officially supported.


In most of these articles we focus on the high-end motherboards in this section but this time around things are being done a bit differently. AMD wanted to showcase the adaptability of their Kaveri APUs so they sent along an ASRock FMA2A88X-ITX+ board along with an ultra quiet Noctua NH-L9a heatsink.

This mini-ITX board goes for just $99 which, when paired up with an A8-7700 and 16GB of memory, gives builders a great small form factor system upgrade for under $250. Combine this with an SSD or hybrid HDD and you’ll have a responsive platform for everything from light gaming to HD movie decoding.
 

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
Inside Kaveri: Steamroller and GCN

Inside Kaveri: Steamroller and GCN


Both Trinity and Richland incorporated a slightly revised Bulldozer architecture code named Piledriver which fixed some of Bulldozer’s issues while incorporating some minor IPC and thread efficiency improvements. Kaveri meanwhile skips ahead to the eagerly anticipated Steamroller architecture.

Like Piledriver before it, Steamroller doesn’t bring about a revolutionary step forward for AMD’s x86 architecture but the changes built into it do represent a logical evolution. AMD has focused on two key areas above and beyond Piledriver’s revisions: IPC boosts and single threaded performance. Indeed, if Steamroller was placed directly against the original Bulldozer in a clock by clock comparison, this new revision would be some 30% faster on average. Instructions per clock will be a key metric of differentiation here since Kaveri is operating at lower clock speeds than its 32nm predecessor.


At a base architectural level, AMD has retained the same Bulldozer modularized design with a single Steamroller module containing a pair of x86 “cores” and 2MB of shared L2 cache. Two of these have been added to a typical Kaveri APU for a total of four physical cores and 4MB of L2 cache.

As we’ve already mentioned numerous times, AMD’s primary focus with Steamroller was to improve performance per watt which was partially achieved by the move to a 28nm manufacturing process. At a more fine-grain level, on-chip computational efficiency was improved by addressing caching accuracy, optimizing branch prediction, and redesigning part of the core’s scheduling routines. Within the pipeline stages, there has been a 25% increases in potential dispatches per thread and the L1 caching hierarchy has undergone some major refinements in the way it handles store functions. In plain English, these changes allow the x86 cores to be fed with information faster which improves IPC by roughly 10% over Piledriver while also rolling in some impressive single thread performance benefits.

One of the major shortcomings with previous AMD core designs has been their support for (or lack thereof) full speed legacy instruction set execution. While Steamroller doesn’t change the game in this regard, some of the more targeted IPC improvements will benefit these situations.


Nearly all of the drastic improvements to Kaveri take place within its graphics subsystem. Gone is the VLIW4 architecture of Trinity / Richland and in its place is a fully equipped R7-series core which boasts AMD’s second generation GCN architecture. In its maximum layout, it has 512 SIMD cores and 32 texture units spread over eight Compute Units alongside support for DX 11.2, Eyefinity, Mantle and TrueAudio.

In many ways this graphics processor uses a design that’s similar to current R7 260X and R7 250 equipped with the Bonaire core. The only real difference from a graphics processing level is the unified shared memory structure present in Kaveri APUs. This is a quantum leap forward since AMD basically skipped over the Southern Islands architecture and went straight towards their latest design.

When it comes to graphics compute, the R7-series core within Kaveri houses features from the higher-end R9 290 parts rather than borrowing from previous generations. Its 8 asynchronous compute engines feature independent scheduling and work item dispatch for efficient multi-tasking and the ability to operate in parallel the with graphics command processor. This will drastically affect how well Kaveri handles tasks like OpenCL workloads and crypto currency mining.


Drilling down a bit further into those Compute Units, we see each is made up of 64 SIMD cores with local memory and L1 cache in place for optimized processing. There is also a quartet of texture units, though the ROPs reside in a secondary render backend structure.

These Compute Units can be disabled or enabled individually to create new parts. For example, while the A10-7850K features the full allotment of eight CUs, the A10-7700K and A8-7600 have two units disabled, resulting in a graphics processor with 384 cores and 24 texture units.


While the rendering hierarchy hasn’t changed much from the first generation GCN Bonaire core (Kaveris’ GPU can only handle one primitive per clock), there have been some major improvements over Richland and Trinity. For example, there’s a more robust caching structure, improved geometry shader processing and an updated tessellation engine with off-chip buffering optimizations. To sum this up, Kaveri’s graphics abilities should allow the lower-end 45W A8-7600 to compete on a level footing against a 100W A10-6800K in graphics intensive workloads.


In an effort to continue their class-leading support for multimedia playback formats, AMD has also updated their Universal Video Decoder and Video Codec Engine. Not only has full 4K support been added but the UVD boasts increased error resiliency when playing back H.264 or AVCHD. Meanwhile, the VCE has seen the lion’s share of improvements with wireless display compatibility and additional H.264 support.


AMD’s Dual Graphics also makes a comeback with Kaveri with the R7 260X and R7 250 being compatible in a type of hybrid Crossfire alongside the onboard graphics engine.

Performance improvements are of course dramatic but the appeal of Kaveri lies purely with its efficiency when its integrated GPU is used so we doubt many users will take advantage of this ability.


Last but not least in this section we wanted to mention the performance benefits that can be realized through the use of faster memory. While every integrated graphics processor from Haswell’s HD4000-series to Kaveri will see some impressive uplifts when paired up with higher bandwidth modules, it’s still good to see these new APUs can reward anyone choosing higher end RAM.
 

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
hUMA & hQ; Memory & Queuing Go Heterogeneous

hUMA & hQ; Memory & Queuing Go Heterogeneous


While Trinity and its predecessor Llano took the first tentative steps towards a heterogeneous architecture approach, they actually featured very few architectural that could be considered next generation in nature. However, Kaveri can be considered the first leap towards the realization of AMD’s long term goals. One of its key elements is the new Heterogeneous Unified Memory Architecture or hUMA which is designed in an effort to eliminate some of HSA’s latent bottlenecks.


While the last 10 years have seen an almost ludicrous theoretical increase in integer and floating point performance on GPUs, these aspects have been by and large stagnating on the CPU side. In essence, the GPU can offer up to ten times the amount of throughput as a high end CPU in highly parallelized workloads. Now that may sound like a significant, nearly unbridgeable chasm between two disparate elements within a system but when working outside of mere theoretical performance, the GPU has ultimately struggled to reach its full potential. We also can’t forget that serial-based workloads are by and large dominated by the CPU.

Outside of its obvious superiority in processing 3D elements within games and OpenGL software, actually harnessing the GPU’s power has always been an issue for programmers. Hence why there are very few applications which can be used by the GPU’s compute components. hUMA plans the level that playing field by allowing the graphics-oriented elements of this equation to play a greater role in overall system performance.


hUMA may sound like a new concept but its roots are firmly planted in the past. In many ways this is the next evolutionary step for the Unified Memory Architecture AMD was instrumental in pioneering nearly a decade ago. However, this time, its various architectural enhancements are focused on facilitating the communication between the CPU and GPU.

Instead of the GPU being used for some programs and the CPU for others, with hUMA programmers now have the ability to leverage both at the same time, thus optimizing their respective performance thresholds. More importantly, software won’t have to worry about doing the hand-off since it is being accomplished natively within the APU’s architecture.


As one might expect by its name, hUMA accomplishes its tasks by incorporating broad scale heterogonous memory integration across the APU’s processing stages. Before hUMA, both the x86 processing cores and graphics subsystem had their own respective memory controllers and addressable memory pools, even within AMD’s Trinity and Llano. In some cases, the amount of memory dedicated to each element could be user modified but for the most part, there was nothing dynamic about it and efficiency was lost.

hUMA on the other hand allows the GPU to enter into the world unified memory by linking it to the same memory address space as the CPU. This leads to an intelligent computing solution which enables the x86 cores, GPU stages and other sub-processors to work in harmony on a single piece of silicone, within a unified memory space while dynamically directing processing tasks to the best suited processor.

This wasn’t an easy accomplishment by any stretch of the imagination. Not only did AMD have to update the R7-series GPU’s instruction set so it could communicate more effectively with system’s memory but hUMA also opened up a world of potential issues as programmers come to grips with what could have been a tricky balancing act.

As we’ll talk about a bit later, the programming issue was resolved, resulting in a litany of noteworthy advantages for systems with hUMA. While this approach may not allow a complete unification between a discrete GPU and its associated CPU, it has far-reaching implications for the APU market and its viability against Intel’s Haswell processors.


In systems without hUMA, both processors could be used in parallel but the entire process was inefficient. It involved a game of hot potato where a large amount of data was being copied between two memory address spaces, causing redundancy where AMD felt there shouldn’t be any. In order to facilitate the data handoff, hUMA ensures all of the data is passed in a dynamic form through the uniform memory interface, resulting in quicker information handling. It isn’t completely shutting out the CPU either. Rather, think of this as an on-the-fly load balancing act between two fully integrated system components.

AMD has said this approach should simplify the “artistry of programming”, allowing programmers to use their time to deliver the best possible experience to the end user. With that on the table, it isn’t like these processing stages weren’t communicating before. However, now that relationship will be more like close siblings having a friendly chat rather than a divorced couple in the midst of a tug-of-war.


With all of the technical elements pushed aside, what really matters is how this technology will make it into the hands of you and me. Thus far GPU compute has been largely relegated to the sidelines since programmers need special languages, tools and memory models to unlock and access the its performance capabilities.

One of the main goals here is to get the buy-in of developers. Without software that supports hUMA, it’ll quickly become yet another standard which was cast aside before fully realizing its potential. In order to accomplish this sometimes hard to attain stamp of approval from the development community, AMD has ensured programming for hUMA-based systems is as efficient as possible. It is fully compatible with industry-standard programming languages like C++, .NET and Python, ensuring the developer community can use existing methodologies in order to attain optimal results.


When designing hUMA, AMD asked a simple question: how do we leverage the relative strengths of our APUs without reinventing the wheel? By creating a direct link between the CPU and GPU they have could have accomplished just that. Instead of a Berlin-wall like partition between these architectural elements, future APUs will be able to dynamically distribute tasks to the best suited co-processor in a way that’s completely transparent to the end user.

Naturally, this all hinges on acceptance from the developer community but with their streamlined use of industry-standard programming languages, AMD seems to have that base covered perfectly. With hUMA in place, hopefully Kaveri will be given its chance to shine.


Heterogeneous Queuing


With the convergence of CPU and GPU workloads in Kaveri, a certain amount of resource sharing has to happen behind the scenes. While we have already covered AMD’s Heterogeneous Unified Memory, there’s another technology being developed to balance workloads: it’s called Heterogeneous Queuing.


In its most basic form, Heterogeneous Queuing (hQ) defines how processors interact equally across a general address space while accessing a common resource pool. The last thing AMD needed was for the two primary elements of their new architecture to continually fight for the same on-die capital.

With Heterogeneous Queuing in place, AMD has added system-level atomics for synchronizing workloads across the different cores so the GPU and CPU have equal flexibility to create and dispatch workloads. As with all other mixed functions, this resource sharing only happens in supported accelerated applications which use OpenCL or other compatible programming languages are used.

Unfortunately, neither of these critical features will be enabled at launch. Simply put, the software necessary for support isn't ready on AMD's part and compatible applications are non-existent.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
Mantle Comes to APUs

Mantle Comes to APUs


In order to understand where Mantle is coming from, we need to go back in time and take the Playstation 3 as an example of how AMD wants to change the way games interact with a PC’s graphics subsystem. While the PS3’s Cell processor and its associated graphics core are extremely hard to program for, games like Uncharted 3 and The Last of Us boast visuals that are equal to if not better than some of today’s newest PC games which run on hardware that was unimaginable when Sony launched their console.

So how was this seemingly impossible feat accomplished? Consoles give developers easier access to the graphics subsystem without messy driver stacks, loads of API overhead, a cluttered OS and other unnecessary eccentricities eating up valuable resources. As a result, console games are able to fully utilize a given resource pool and allow programmers to do more with less. In some cases (the PS3 is another excellent example of this) the flow towards true utilization takes a bit longer as programmers have to literally relearn how to approach their trade but AMD's focus here is to streamline the whole process.


Mantle has been created to reduce the number of obstacles placed before developers when they’re trying to create new PC titles or port games over from consoles. In the past, things like CPU optimizations and efficient inter-component communication have largely been pushed aside as developers struggled to come to grips with the wide range of PC hardware configurations being used. This leads to multi core CPUs remaining idle, the GPU’s on-die resources being wasted and a real lack of optimal performance conditions on the PC, regardless of its advanced hardware.

There’s also a very heavy software component when programming for the PC environment since developers routinely have to contend with a predominantly heavy driver stack and slowly evolving primary level software. That’s a problem since it leads to the software / memory interaction becoming a rather stringent traffic light, bottlenecking the flow of information between the CPU and GPU, limiting throughput.

DirectX 10 and DX11 have gone a long way towards addressing some of these roadblocks but their overall performance is still hindered by their high-level nature. They keep communication between the API, GPU, game and CPU under strict control, something developers don’t want to wade through. When using them, transmitting a large number of draw calls leads to a CPU bottleneck, meaning today’s graphics architectures can never realize their full potential.


This is where Mantle gets factored into the equation; not as a direct replacement for DirectX or OpenGL but rather as a complementary force. It’s an API that focuses on “bare metal”, low level programming with a thin, lightweight driver that effectively manages resource distribution, grants additional control over the graphics memory interface and optimizes those aforementioned draw-calls. Think of Mantle like a low level strafing run that targets key components rather than high level carpet bombing that may or may not achieve a given objective.

With a more direct line of access to the GPU, AMD is hoping that GCN’s performance could drastically increase through rendering efficiencies rather than having to throw raw horsepower at problems. Opening up new rendering techniques which aren’t tied at the hip to today’s primary APIs is also a possibility. Theoretically, this could allow Mantle to process a ninefold increase in draw-calls and more importantly, it will ensure optimizations can be carried over from the console version of a game to the PC and vice versa.


There are also some notable speedbumps to this approach as well. While the high-level API (in this case DirectX / Direct3D) will remain the same across multiple hardware and product classes, Mantle is only compatible with GCN. This is great for Kaveri since it houses a GCN-based graphics processor within its confines.

It goes without saying that AMD has won the next generation console race with the Jaguar APU on both Xbox One and PS4 so leveraging those design wins is an integral part of their future strategy. But very little has been said about the high-level and lower-level APIs being used within those products, primarily the Xbox One. Direct3D 11.2 is a given but no one could point a finger at the low-level API. Microsoft has been forthcoming by saying that it isn't Mantle but the inclusion of native DirectX HLSL compatibility could go a long way towards making AMD’s cross-platform dreams come true.

In many ways, this approach reminds us of 3dfx’s Glide, another low-level application programming interface developed years ago but doomed to failure due to a lack of developer support and its parent company’s eventual demise.


Mantle is particularly important for APUs like Kaveri since it offers a significant performance boost in supporting titles. It can take an A8-7600’s mediocre in-game framerates and bring them up to the mid-level discrete threshold, providing playable experience at reasonable detail settings. For example, AMD claims they’ve realized a 45% jump in Battlefield 4 performance by switching to Mantle while other games can benefit even more.
 

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
TrueAudio & Kaveri; An Audio Masterpiece?

TrueAudio & Kaveri; An Audio Masterpiece?


When we think of gaming in relation to graphics cards, the first thing that likely comes to mind will be in-game image fidelity and how quickly a given solution can process high graphical detail levels. Realism and player immersion is only partially determined by how “good” a game looks and there are many other factors that contribute to how engaged a player will be in a game. Unfortunately, in the grand scheme of game design and a push towards higher end graphics, the soundstage is often overlooked despite its ability to define an environment and truly draw a gamer in.

Multi channel positional audio goes a long way towards player immersion but the actual quality produced by current solutions isn’t usually up to the standards most expect. We’ve all heard it time and again: a multitude of sounds which get jumbled together or a simple lack of ambient sound with the sole focus being put on the player’s gunshots or footsteps. Basically, it’s almost impossible to find a game with the high definition, visceral audio tracks found in today’s Hollywood blockbusters despite the fact that developers sink hundreds of millions into their titles.


The lack of developer generated, high quality audio tracks isn’t absent for lack of trying. Indeed, the middleware software and facilitators are already present in the marketplace but developers have a finite amount of CPU resources to work with. Typically those CPU cycles have to be shared with primary tasks such as game world building, compute, A.I., physics and simply running the game’s main programming. As you might expect, audio processing is relatively low in the pecking order and rarely gets the reserved CPU bandwidth many think it deserves. This is where AMD’s TrueAudio gets factored into the equation.

While sound cards and other forms of external audio renderers can take some load off the processor’s shoulders, they don’t actually handle the lion’s share of actual processing and sound production. TrueAudio on the other hand remains in the background, acting as a facilitator for audio processing and sound creation and allows for ease-of-use from a development perspective, thus freeing up CPU resources for other tasks.

TrueAudio’s stack provides a highly programmable audio pipeline and allows for decoding, mixing and other features to be done within a versatile environment. This frees programmers from the constraints typically placed upon audio processing during the game creation process.

In order to give TrueAudio some context, let’s compare it to graphics engine development. Audio engineers and programmers usually record real-world sounds and then mix them down or modify layers to create a given effect. Does the player need to hear a gunshot at some point? Record a gunshot and mix accordingly. There is very little ground-up environmental modeling like game designers do with triangles and other graphics tools.

TrueAudio on the other hand allows audio teams to get a head start on the sound development process by creating custom algorithms without having to worry about CPU overhead. As a result, it could allow for more audio detailing without running headfirst into a limited allocation of processor cycles.


According to AMD, one of the best features of TrueAudio is its transparency to developers since it can be accessed through the exact same means as the current audio stack. There aren’t any new languages to learn since it can be utilized through current third party middleware programs, making life for audio programmers easier and allowing for enhanced artistic freedom.

TrueAudio’s position within the audio stack enhances its perception as a facilitator since it runs behind the scenes, rather than attempting to run the show. Supporting game audio tracks are passed to TrueAudio, processed and then sent back to the main Windows Audio stack so it can be output as normal towards the sound card, USB audio driver or via the graphics processor’s HDMI / DisplayPort. It doesn’t take the place of a sound card but rather expands the possibilities for developers and works alongside the standard pipeline to ensure audio fidelity remains high.


TrueAudio is implemented directly within supporting Radeon graphics cards (the R7 260X, R9 290 and R9 290X) via a set of dedicated Tensilica HiFi EP audio DSP cores housed within the APU die. These cores will be dedicated to in-game audio processing and feature floating point as well as fixed point sound processing which gives game studios significantly more freedom than they currently have. It also allows for offloading the processing part of audio rather than remaining tied at the hip to CPU cycles.

In order to ensure quick, seamless access to routing and bridging is possible, the DSPs have rapid access to local-level memory via onboard cache and RAM. There’s also shared instruction data for the streaming DMA engine and other secondary audio processing stages. More importantly, the main bus interface plugs directly into the high speed display pipeline with its frame buffer memory for guaranteed memory access. At all times

While TrueAudio ensures that processing can be done on dedicated DSP cores rather than on the main graphics cores, there can still be a CPU component here as well since TrueAudio is simply supplementing what the main processor is already tasked with doing. In some cases, these CPU algorithms can build upon TrueAudio platform, enhancing audio immersion even more.


One of the primary challenges for audio engineers has always be the creation of a three dimensional audio space through stereo headphones. In typical setup, the in-game engine does the preliminary processing and then mixes down the tracks to simple stereo sound. Additional secondary DSPs (typically located on a USB headphone amp) then render the track into virtual surround signal across a pair of channels, adding in the necessary reverberations, separation and other features to effectively “trick” a user into hearing a directionally-enhanced soundstage. The end result is typically less than stellar since the sounds tend to get jumbled up due to a lack definition.

TrueAudio helps virtual surround sound along by offering a quick pathway for its processing. It uses a high quality DSP which insures individual channels can be separated and addressed with their own dedicated, primary pipeline. AMD has teamed up with GenAudio to get this figured out and from presentations we’ve seen, it seems like they’ve made some incredible headway thus far.


While nothing has to be changed from a developer standpoint since all third party applications and runtimes can work with TrueAudio, this new addition can leveraged for more than just optimizing CPU utilization. Advanced effects, a richer soundstage, clearer voice tracks and more can all be enabled due to its lower overhead and broad-ranging application support. In addition, mastering limiters can allow for individual sounds to come through without distortion.

Unlike some applications, TrueAudio isn’t an end-all-be-all solution since it can be used to target select, high bandwidth streams so not all sounds have to be processed through it. AMD isn’t cutting the CPU out of this equation and that’s important as they move towards a heterogeneous computing environment.


As with all new initiatives, the failure or success of TrueAudio will largely depend on the willingness of developers to support it. While it feels like we've been down this road before with HD3D, Bullet Physics and other AMD marketing points from years past that never really got off the ground, we fell like TrueAudio can shine. Developers are already onboard and AMD has gone through great pains to make its development process easy.

Audio is one of the last frontiers that hasn’t been already addressed. Anything that improves the PC audio experience is welcome but don’t expect TrueAudio to work miracles. It will still only be as good as the end point hardware (in this case your headphones and associated sound card) but it should allow better speaker setups to shine, taking immersion to the next level. That’s a big deal for entry-level APUs.
 

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
Test Setups & Methodology

Test Setups & Methodology


For this review, we have prepared a number of different test setups, representing many of the popular platforms at the moment. As much as possible, the test setups feature identical components, memory timings, drivers, etc. Aside from manually selecting memory frequencies and timings, every option in the BIOS was at its default setting.


For all of the benchmarks, appropriate lengths are taken to ensure an equal comparison through methodical setup, installation, and testing. The following outlines our testing methodology:

A) Windows is installed using a full format.

B) Chipset drivers and accessory hardware drivers (audio, network, GPU) are installed.

C)To ensure consistent results, a few tweaks are applied to Windows 7 and the NVIDIA control panel:
  • UAC – Disabled
  • Indexing – Disabled
  • Superfetch – Disabled
  • System Protection/Restore – Disabled
  • Problem & Error Reporting – Disabled
  • Remote Desktop/Assistance - Disabled
  • Windows Security Center Alerts – Disabled
  • Windows Defender – Disabled
  • Screensaver – Disabled
  • Power Plan – High Performance
  • V-Sync – Off

D) Windows updates are then completed installing all available updates

E) All programs are installed and then updated.

F) Benchmarks are each run three to eight times, and unless otherwise stated, the results are then averaged.

G) All processors had their energy saving options / c-states enabled
 

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
System Benchmarks: AIDA64 / Cinebench r11.5

System Benchmarks


In this section, we will be using a combination of synthetic benchmarks which stress the CPU and system in a number of different domains. Most of these tests are easy to acquire or are completely free to use so anyone reading this article can easily repeat our tests on their own systems.

To vary the results as much as possible, we have chosen a selection of benchmarks which focus upon varied instruction sets (SSE, SSE3, 3DNow!, AVX, etc.) and different internal CPU components like the floating point units and general processing stages.



AIDA64 Extreme Edition


AIDA64 uses a suite of benchmarks to determine general performance and has quickly become one of the de facto standards among end users for component comparisons. While it may include a great many tests, we used it for general CPU testing (CPU ZLib / CPU Hash) and floating point benchmarks (FPU VP8 / FPU SinJulia).


CPU ZLib Benchmark

This integer benchmark measures combined CPU and memory subsystem performance through the public ZLib compression library. CPU ZLib test uses only the basic x86 instructions but is nonetheless a good indicator of general system performance.




CPU Hash Benchmark

This benchmark measures CPU performance using the SHA1 hashing algorithm defined in the Federal Information Processing Standards Publication 180-3. The code behind this benchmark method is written in Assembly. More importantly, it uses MMX, MMX+/SSE, SSE2, SSSE3, AVX instruction sets, allowing for increased performance on supporting processors.




FPU VP8 / SinJulia Benchmarks

AIDA’s FPU VP8 benchmark measures video compression performance using the Google VP8 (WebM) video codec Version 0.9.5 and stresses the floating point unit. The test encodes 1280x720 resolution video frames in 1-pass mode at a bitrate of 8192 kbps with best quality settings. The content of the frames are then generated by the FPU Julia fractal module. The code behind this benchmark method utilizes MMX, SSE2 or SSSE3 instruction set extensions.

Meanwhile, SinJulia measures the extended precision (also known as 80-bit) floating-point performance through the computation of a single frame of a modified "Julia" fractal. The code behind this benchmark method is written in Assembly, and utilizes trigonometric and exponential x87 instructions.





CineBench r11.5 64-bit


The latest benchmark from MAXON, Cinebench R11.5 makes use of all your system's processing power to render a photorealistic 3D scene using various different algorithms to stress all available processor cores. The test scene contains approximately 2,000 objects containing more than 300,000 total polygons and uses sharp and blurred reflections, area lights and shadows, procedural shaders, antialiasing, and much more. This particular benchmarking can measure systems with up to 64 processor threads. The result is given in points (pts). The higher the number, the faster your processor.



RESULTS: The first round of benchmarks pretty much back up with AMD was warning: due to its lower clocks, the A10 7850K either matches or looses to its predecessor, the A10 6800K. The only reason why things are even close is the IPC improvements AMD built into the Kaveri architecture.

The A8 7600 on the other hand seems to be an excellent all-round performer considering its low price. Granted, switching it to 45W brings about a large drop in performance but it still remains quite competitive.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,857
Location
Montreal
System Benchmarks: Civ V / PCMark 7

System Benchmarks (pg.2)



Civilization V: Gods & Kings Unit Benchmark


Civilization V includes a number of benchmarks which run on the CPU, GPU or a combination thereof. The Unit Benchmark simulates thousands of units and actions being generated at the same time, stresses multi core CPUs, system memory and GPU We give the non-rendered score below as it is more pertinent to overall CPU performance within the application.




PCMark 7


PCMark 7 is the latest iteration of Futuremark’s system benchmark franchise. It generates an overall score based upon system performance with all components being stressed in one way or another. The result is posted as a generalized score. We also give the Computation Suite score as it isolates the CPU and memory within a single test, without the influence of other components.




RESULTS: Once again in these tests we see the A10 7850K being narrowly beaten by the A10 6800K. This is a trend which will likely continue into other benchmarks as well since its IPC improvements can only go so far towards beefing up performance.

The A8 looks excellent, often performing within 10-15% of its more expensive sibling when in 65W mode. Switching to 45W does bring about a massive drop in the TDP-limited Civilization V benchmark but there are very few instances when a normal user will but this type of load on a lower end part.
 
Last edited:

Latest posts

Twitter

Top