What's new
  • Please do not post any links until you have 3 posts as they will automatically be rejected to prevent SPAM. Many words are also blocked due to being used in SPAM Messages. Thanks!

Kaveri Mobile APUs; AMD's FX Reincarnated

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,840
Location
Montreal
With the introduction of Kaveri and its HSA features, AMD’s APU lineup took a large step towards providing a competitive feature set. However, regardless of how well the desktop side has been received, there was still a large hole within AMD’s lineup: until now, the mobile computing market has been serviced by the previous generation Richland architecture. After months of waiting Kaveri for notebooks has finally arrived in performance and ultra low voltage forms.

The advent of mobile Kaveri chips comes at a critical juncture for AMD since Haswell CPUs have gobbled up the lion’s share of notebook processor sales. Put simple, the 32nm Richland APUs stood very little chance of competing on a level footing against Intel’s latest generation 22nm offerings. From battery life to x86 performance Haswell was vastly superior so system builders found very few reasons to use Richland despite its superiority from a graphics and GPU compute perspective. Kaveri is supposed to change that equation in a big way.

3.PNG

One of the major differentiators with Kaveri is its implementation of several Heterogeneous System Archtecture (HSA) features. Heterogeneous Unified Memory Architecture (hUMA) and Heterogeneous Queuing have been carried over from the desktop iteration. AMD hopes these will allow the APU’s x86 processing cores and graphics modules to back each other up in supporting applications, thereby creating a vastly more powerful processing ecosystem. The only concern at this point in time is the lack of software which natively supports hUMA and hQ. HSA and its associated technologies rely on software support to achieve their ultimate goals but without that key component, AMD’s architectures may continue to underachieve.

In an effort to broaden the appeal of APU’s despite a slow adoption rate of HSA, AMD has boosted Kaveri’s performance in other areas as well. By utilizing their Steamroller architecture, the IPC of x86 cores has increased and GCN-based GPU cores allow for a noteworthy speedup on the graphics front. Furthermore, the move to a 28nm manufacturing process boosts overall architectural efficiency to the point where AMD was been able to accomplish significantly better performance while maintaining competitive TDP levels.

When looking at the new APU lineups below, you’ll notice that AMD has done a lot of cutting and rationalizing. Each segment only receives a few relatively well targeted SKUs while the A4 and E-series have moved down into mainstream Puma+ based Beema 6000-series stack. In addition, none of the Kaveri-based mobile processors are meant to compete head to head against Intel’s higher performance Haswell chips.

4.PNG

Looking at AMD’s so-called “Performance” notebook lineup (denoted by a “P” at the end of the product number), you’ll notice that an old friend has returned: the FX-series. Don’t take this to mean that the FX-7600P has an unlocked multiplier or any other overclocker-friendly tools in its box because it doesn’t, nor can it even be overclocked. Rather, AMD figures they finally have an APU which is powerful enough to go up against some of Intel’s best alternatives so this storied brand name is being effectively revived in another role.

Operating with a base clock of 2.7GHz and a Turbo frequency of 3.6GHz, the FX-7600P is the fastest mobile APU AMD has ever launched. Even though its highest speed is only 100MHz faster than the A10-5750M, performance should be something in the neighborhood of 12-15% better due to Steamroller’s IPC improvements. It also receives R7-series graphics with eight SIMD engines totaling 512 Radeon cores operating at a maximum of 686MHz (AMD no longer publishes Base clocks). There’s also support for DDR3-2133 memory and a 4MB L2 cache. Even though performance has received a serious shot of adrenaline, TDP has remained at 35W.

The A10-7400P replaces the outgoing A10-5750M with an APU that has very similar CPU performance but significantly better graphics capabilities. Meanwhile, the A8-7200P should prove to be quite popular among system builders due to its cost and reasonably good improvements over the previous generation’s A8-series.

5.PNG

There’s also an interesting story in AMD’s ultra low voltage category since these APUs are finally able to compete against some of the best ULV processors in Intel’s current inventory. Arguably, these are the most compelling products being announced today.

Once again an FX-series leads with way with the FX-7500, a quad core processor that can run at 3.3GHz and supports DDR3-1600L memory while boasting a TDP of just 19W. That represents a massive improvement over the A10-5745M even without taking Steamroller’s design improvements into account. More importantly, the FX-7500 has the capability to compete against Intel’s 15W i7-4500U which still requires less power but doesn’t boast the 7500’s graphics capabilities.

Moving a bit further down market, there’s the A10-7300 which is AMD’s spiritual replacement for their last ultra low voltage A10. There’s also an A8-7100 with four cores and four SIMD engines within its R5 series GPU. We’re actually expecting these two APUs to take up the lion’s share of sales since they can pack a punch and offer a compelling price / TDP ratio for notebook manufacturers.

One thing that’s looks to be missing from the Kaveri notebook lineup are the usual lower end A-series parts; the A6 and A4. At least in the mass market, those segments will be addressed with the Beema ultra mobile parts and to a lesser extent Mullins APUs.

6.PNG

Rounding out the new AMD lineup is a trio of APUs which makes up the new business-focused commercial ultra low voltage product stack. So what exactly makes a “commercial” APU? According to AMD, it’s an emphasis on longevity within each SKU and stability of BIOS and drivers.

Longevity has been attained through a longer product refresh cycle which means the APUs you see above will be sticking around for a while. This facilitates the jobs of IT managers who typically struggle with products that reach end of life status within a short amount of time.

Perhaps the most enticing element of the commercial ULV APUs is AMD’s commitment to a BIOS and software stack that’s released with the stability expected by corporations. That means less revisions in the near-term future so corporate buyers can have peace of mind that additional expenses won’t be incurred by rolling out bug fixes on a regular basis. We can’t emphasize enough how hard this will be for AMD to accomplish; hitting the ground running has always been a challenge for them but if they can achieve the aforementioned balance, clients like Lenovo, HP and Dell line up for this platform.

KAVERI-APU-4.png

There are of course a few returning technologies as well. While Kaveri notebook APUs aren’t pin compatible with Richland and Trinity platforms, they still carry over the Dual Graphics features from previous generations (though it’s now compatible with GCN-based discrete add-in cards) and have fully configurable TDPs.

That concludes our overview of the new notebook-focused Kaveri lineup but it’s not the end of this article. What follows on the upcoming pages is very much a rehash of tech-focused sections of our previous Kaveri desktop review but with a mobile slant. We’ve also posted a number of AMD-supplied benchmarks but within them, you won’t notice any mention of the so-called “Performance” lineup.

With all of that being said, if you’re interested in the fine-grain details behind Kaveri read on. If not, wait for our reviews of supporting notebooks when they’re ever available.
 
Last edited:

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,840
Location
Montreal
Inside Kaveri: Steamroller and GCN

Inside Kaveri: Steamroller and GCN


Both Trinity and Richland incorporated a slightly revised Bulldozer architecture code named Piledriver which fixed some of Bulldozer’s issues while incorporating minor IPC and threaded efficiency improvements. Kaveri meanwhile skips ahead to the eagerly anticipated Steamroller architecture.

Like Piledriver before it, Steamroller doesn’t bring about a revolutionary step forward for AMD’s x86 architecture but the changes built into it do represent a logical evolution. AMD has focused on two key areas above and beyond Piledriver’s revisions: IPC boosts and single threaded performance. Indeed, if Steamroller was placed directly against the original Bulldozer in a clock by clock comparison, this new revision would be some 30% faster on average. Instructions per clock will be a key metric of differentiation here since Kaveri is operating at speeds that are similar to its 32nm predecessor.

KAVERI-APU-16.png

At a base architectural level, AMD has retained the same modularized Bulldozer design with a single Steamroller module containing a pair of x86 “cores” and 2MB of shared L2 cache. Two of these have been added to a typical Kaveri notebook APU for a total of four physical cores and 4MB of L2 cache.

As we’ve already mentioned numerous times, AMD’s primary focus with Steamroller was to improve performance per watt which was partially achieved by the move to a 28nm manufacturing process. At a more fine-grain level, on-chip computational efficiency was improved by addressing caching accuracy, optimizing branch prediction, and redesigning part of the core’s scheduling routines.

Within the pipeline stages, there has been a 25% increases in potential dispatches per thread and the L1 caching hierarchy has undergone some major refinements in the way it handles store functions. In plain English, these changes allow the x86 cores to be fed with information faster which improves IPC by roughly 10% over Piledriver while also rolling in some impressive single thread performance benefits.

One of the major shortcomings with previous AMD core designs has been their support for (or lack thereof) full speed legacy instruction set execution. While Steamroller doesn’t change the game in this regard, some of the more targeted IPC improvements will benefit these situations.

KAVERI-APU-17.png

Nearly all of the drastic improvements to Kaveri take place within its graphics subsystem. Gone is the VLIW4 architecture of Trinity / Richland and in its place is a fully equipped R-series core which boasts AMD’s second generation GCN architecture. In its maximum layout, it has 512 SIMD cores and 32 texture units spread over eight Compute Units alongside support for DX 11.2, Eyefinity, Mantle and TrueAudio.

In many ways this graphics processor uses a design that’s similar to current R7 260X and R7 250 equipped with the Bonaire core. The only real difference at a graphics processing level is the unified shared memory structure present in Kaveri APUs. This is a quantum leap forward since AMD basically skipped over the Southern Islands architecture and went straight towards their latest design.

When it comes to graphics compute, the R-series cores within Kaveri house features from the higher-end R9 290 parts rather than borrowing from previous generations. Its 8 asynchronous compute engines feature independent scheduling and work item dispatch for efficient multi-tasking and the ability to operate in parallel the with graphics command processor. This will drastically affect how well Kaveri handles tasks like OpenCL workloads and crypto currency mining.

KAVERI-APU-18.png

Drilling down a bit further into those Compute Units, we see each is made up of 64 SIMD cores with local memory and L1 cache in place for optimized processing. There is also a quartet of texture units, though the ROPs reside in a secondary render backend structure.

These Compute Units can be disabled or enabled individually to create new parts. For example, while the FX-7600P features the full allotment of eight CUs, the A10-7400P and A8-7200P have two and four units disabled respectively, resulting in a graphics processor with 384 or 256 cores.

KAVERI-APU-19.png

While the rendering hierarchy hasn’t changed much from the first generation GCN Bonaire core (Kaveri’s GPU can only handle one primitive per clock), there have been some major improvements over Richland and Trinity mobile processors. For example, there’s a more robust caching structure, improved geometry shader processing and an updated tessellation engine with off-chip buffering optimizations.

KAVERI-APU-22.png

In an effort to continue their class-leading support for multimedia playback formats, AMD has also updated their Universal Video Decoder and Video Codec Engine. Not only has full 4K support been added but the UVD boasts increased error resiliency when playing back H.264 or AVCHD. Meanwhile, the VCE has seen the lion’s share of improvements with wireless display compatibility and additional H.264 support.
 

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,840
Location
Montreal
hUMA & hQ; Memory & Queuing Go Heterogeneous

hUMA & hQ; Memory & Queuing Go Heterogeneous


While Trinity and its predecessor Llano took the first tentative steps towards a heterogeneous architecture approach, they actually featured very few architectural that could be considered next generation in nature. However, Kaveri can be considered the first leap towards the realization of AMD’s long term goals. One of its key elements is the new Heterogeneous Unified Memory Architecture or hUMA which is designed in an effort to eliminate some of HSA’s latent bottlenecks.

HUMA-1.png

While the last 10 years have seen an almost ludicrous theoretical increase in integer and floating point performance on GPUs, these aspects have been by and large stagnating on the CPU side. In essence, the GPU can offer up to ten times the amount of throughput as a high end CPU in highly parallelized workloads. Now that may sound like a significant, nearly unbridgeable chasm between two disparate elements within a system but when working outside of mere theoretical performance, the GPU has ultimately struggled to reach its full potential. We also can’t forget that serial-based workloads are by and large dominated by the CPU.

Outside of its obvious superiority in processing 3D elements within games and OpenGL software, actually harnessing the GPU’s power has always been an issue for programmers. Hence why there are very few applications which can be used by the GPU’s compute components. hUMA plans the level that playing field by allowing the graphics-oriented elements of this equation to play a greater role in overall system performance.

HUMA-2.png

hUMA may sound like a new concept but its roots are firmly planted in the past. In many ways this is the next evolutionary step for the Unified Memory Architecture AMD was instrumental in pioneering nearly a decade ago. However, this time, its various architectural enhancements are focused on facilitating the communication between the CPU and GPU.

Instead of the GPU being used for some programs and the CPU for others, with hUMA programmers now have the ability to leverage both at the same time, thus optimizing their respective performance thresholds. More importantly, software won’t have to worry about doing the hand-off since it is being accomplished natively within the APU’s architecture.

HUMA-3.png

As one might expect by its name, hUMA accomplishes its tasks by incorporating broad scale heterogonous memory integration across the APU’s processing stages. Before hUMA, both the x86 processing cores and graphics subsystem had their own respective memory controllers and addressable memory pools, even within AMD’s Trinity and Llano. In some cases, the amount of memory dedicated to each element could be user modified but for the most part, there was nothing dynamic about it and efficiency was lost.

hUMA on the other hand allows the GPU to enter into the world unified memory by linking it to the same memory address space as the CPU. This leads to an intelligent computing solution which enables the x86 cores, GPU stages and other sub-processors to work in harmony on a single piece of silicone, within a unified memory space while dynamically directing processing tasks to the best suited processor.

This wasn’t an easy accomplishment by any stretch of the imagination. Not only did AMD have to update the R7-series GPU’s instruction set so it could communicate more effectively with system’s memory but hUMA also opened up a world of potential issues as programmers come to grips with what could have been a tricky balancing act.

As we’ll talk about a bit later, the programming issue was resolved, resulting in a litany of noteworthy advantages for systems with hUMA. While this approach may not allow a complete unification between a discrete GPU and its associated CPU, it has far-reaching implications for the APU market and its viability against Intel’s Haswell processors.

HUMA-5.png

In systems without hUMA, both processors could be used in parallel but the entire process was inefficient. It involved a game of hot potato where a large amount of data was being copied between two memory address spaces, causing redundancy where AMD felt there shouldn’t be any. In order to facilitate the data handoff, hUMA ensures all of the data is passed in a dynamic form through the uniform memory interface, resulting in quicker information handling. It isn’t completely shutting out the CPU either. Rather, think of this as an on-the-fly load balancing act between two fully integrated system components.

AMD has said this approach should simplify the “artistry of programming”, allowing programmers to use their time to deliver the best possible experience to the end user. With that on the table, it isn’t like these processing stages weren’t communicating before. However, now that relationship will be more like close siblings having a friendly chat rather than a divorced couple in the midst of a tug-of-war.

HUMA-6.png

With all of the technical elements pushed aside, what really matters is how this technology will make it into the hands of you and me. Thus far GPU compute has been largely relegated to the sidelines since programmers need special languages, tools and memory models to unlock and access the its performance capabilities.

One of the main goals here is to get the buy-in of developers. Without software that supports hUMA, it’ll quickly become yet another standard which was cast aside before fully realizing its potential. In order to accomplish this sometimes hard to attain stamp of approval from the development community, AMD has ensured programming for hUMA-based systems is as efficient as possible. It is fully compatible with industry-standard programming languages like C++, .NET and Python, ensuring the developer community can use existing methodologies in order to attain optimal results.

KAVERI-APU-21.png

When designing hUMA, AMD asked a simple question: how do we leverage the relative strengths of our APUs without reinventing the wheel? By creating a direct link between the CPU and GPU they have could have accomplished just that. Instead of a Berlin-wall like partition between these architectural elements, future APUs will be able to dynamically distribute tasks to the best suited co-processor in a way that’s completely transparent to the end user.

Naturally, this all hinges on acceptance from the developer community but with their streamlined use of industry-standard programming languages, AMD seems to have that base covered perfectly. With hUMA in place, hopefully Kaveri will be given its chance to shine.


Heterogeneous Queuing


With the convergence of CPU and GPU workloads in Kaveri, a certain amount of resource sharing has to happen behind the scenes. While we have already covered AMD’s Heterogeneous Unified Memory, there’s another technology being developed to balance workloads: it’s called Heterogeneous Queuing.

KAVERI-APU-13.png

In its most basic form, Heterogeneous Queuing (hQ) defines how processors interact equally across a general address space while accessing a common resource pool. The last thing AMD needed was for the two primary elements of their new architecture to continually fight for the same on-die capital.

With Heterogeneous Queuing in place, AMD has added system-level atomics for synchronizing workloads across the different cores so the GPU and CPU have equal flexibility to create and dispatch workloads. As with all other mixed functions, this resource sharing only happens in supported accelerated applications which use OpenCL or other compatible programming languages are used.

Unfortunately, neither of these critical features will be enabled at launch. Simply put, the software necessary for support isn't ready on AMD's part and compatible applications are non-existent.
 

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,840
Location
Montreal
TrueAudio & Kaveri; An Audio Masterpiece?

TrueAudio & Kaveri; An Audio Masterpiece?


When we think of gaming in relation to graphics cards, the first thing that likely comes to mind will be in-game image fidelity and how quickly a given solution can process high graphical detail levels. Realism and player immersion is only partially determined by how “good” a game looks and there are many other factors that contribute to how engaged a player will be in a game. Unfortunately, in the grand scheme of game design and a push towards higher end graphics, the soundstage is often overlooked despite its ability to define an environment and truly draw a gamer in.

Multi channel positional audio goes a long way towards player immersion but the actual quality produced by current solutions isn’t usually up to the standards most expect. We’ve all heard it time and again: a multitude of sounds which get jumbled together or a simple lack of ambient sound with the sole focus being put on the player’s gunshots or footsteps. Basically, it’s almost impossible to find a game with the high definition, visceral audio tracks found in today’s Hollywood blockbusters despite the fact that developers sink hundreds of millions into their titles.

R7260X-REVIEW-21.png

The lack of developer generated, high quality audio tracks isn’t absent for lack of trying. Indeed, the middleware software and facilitators are already present in the marketplace but developers have a finite amount of CPU resources to work with. Typically those CPU cycles have to be shared with primary tasks such as game world building, compute, A.I., physics and simply running the game’s main programming. As you might expect, audio processing is relatively low in the pecking order and rarely gets the reserved CPU bandwidth many think it deserves. This is where AMD’s TrueAudio gets factored into the equation.

While sound cards and other forms of external audio renderers can take some load off the processor’s shoulders, they don’t actually handle the lion’s share of actual processing and sound production. TrueAudio on the other hand remains in the background, acting as a facilitator for audio processing and sound creation and allows for ease-of-use from a development perspective, thus freeing up CPU resources for other tasks.

TrueAudio’s stack provides a highly programmable audio pipeline and allows for decoding, mixing and other features to be done within a versatile environment. This frees programmers from the constraints typically placed upon audio processing during the game creation process.

In order to give TrueAudio some context, let’s compare it to graphics engine development. Audio engineers and programmers usually record real-world sounds and then mix them down or modify layers to create a given effect. Does the player need to hear a gunshot at some point? Record a gunshot and mix accordingly. There is very little ground-up environmental modeling like game designers do with triangles and other graphics tools.

TrueAudio on the other hand allows audio teams to get a head start on the sound development process by creating custom algorithms without having to worry about CPU overhead. As a result, it could allow for more audio detailing without running headfirst into a limited allocation of processor cycles.

R7260X-REVIEW-15.png

According to AMD, one of the best features of TrueAudio is its transparency to developers since it can be accessed through the exact same means as the current audio stack. There aren’t any new languages to learn since it can be utilized through current third party middleware programs, making life for audio programmers easier and allowing for enhanced artistic freedom.

TrueAudio’s position within the audio stack enhances its perception as a facilitator since it runs behind the scenes, rather than attempting to run the show. Supporting game audio tracks are passed to TrueAudio, processed and then sent back to the main Windows Audio stack so it can be output as normal towards the sound card, USB audio driver or via the graphics processor’s HDMI / DisplayPort. It doesn’t take the place of a sound card but rather expands the possibilities for developers and works alongside the standard pipeline to ensure audio fidelity remains high.

R7260X-REVIEW-22.png

TrueAudio is implemented directly within supporting Radeon graphics cards (the R7 260X, R9 290 and R9 290X) via a set of dedicated Tensilica HiFi EP audio DSP cores housed within the APU die. These cores will be dedicated to in-game audio processing and feature floating point as well as fixed point sound processing which gives game studios significantly more freedom than they currently have. It also allows for offloading the processing part of audio rather than remaining tied at the hip to CPU cycles.

In order to ensure quick, seamless access to routing and bridging is possible, the DSPs have rapid access to local-level memory via onboard cache and RAM. There’s also shared instruction data for the streaming DMA engine and other secondary audio processing stages. More importantly, the main bus interface plugs directly into the high speed display pipeline with its frame buffer memory for guaranteed memory access. At all times

While TrueAudio ensures that processing can be done on dedicated DSP cores rather than on the main graphics cores, there can still be a CPU component here as well since TrueAudio is simply supplementing what the main processor is already tasked with doing. In some cases, these CPU algorithms can build upon TrueAudio platform, enhancing audio immersion even more.

R7260X-REVIEW-23.png

One of the primary challenges for audio engineers has always be the creation of a three dimensional audio space through stereo headphones. In typical setup, the in-game engine does the preliminary processing and then mixes down the tracks to simple stereo sound. Additional secondary DSPs (typically located on a USB headphone amp) then render the track into virtual surround signal across a pair of channels, adding in the necessary reverberations, separation and other features to effectively “trick” a user into hearing a directionally-enhanced soundstage. The end result is typically less than stellar since the sounds tend to get jumbled up due to a lack definition.

TrueAudio helps virtual surround sound along by offering a quick pathway for its processing. It uses a high quality DSP which insures individual channels can be separated and addressed with their own dedicated, primary pipeline. AMD has teamed up with GenAudio to get this figured out and from presentations we’ve seen, it seems like they’ve made some incredible headway thus far.

R7260X-REVIEW-16.png

While nothing has to be changed from a developer standpoint since all third party applications and runtimes can work with TrueAudio, this new addition can leveraged for more than just optimizing CPU utilization. Advanced effects, a richer soundstage, clearer voice tracks and more can all be enabled due to its lower overhead and broad-ranging application support. In addition, mastering limiters can allow for individual sounds to come through without distortion.

Unlike some applications, TrueAudio isn’t an end-all-be-all solution since it can be used to target select, high bandwidth streams so not all sounds have to be processed through it. AMD isn’t cutting the CPU out of this equation and that’s important as they move towards a heterogeneous computing environment.

R7260X-REVIEW-14.png

As with all new initiatives, the failure or success of TrueAudio will largely depend on the willingness of developers to support it. While it feels like we've been down this road before with HD3D, Bullet Physics and other AMD marketing points from years past that never really got off the ground, we fell like TrueAudio can shine. Developers are already onboard and AMD has gone through great pains to make its development process easy.

Audio is one of the last frontiers that hasn’t been already addressed. Anything that improves the PC audio experience is welcome but don’t expect TrueAudio to work miracles. It will still only be as good as the end point hardware (in this case your headphones and associated sound card) but it should allow better speaker setups to shine, taking immersion to the next level. That’s a big deal for entry-level notebook APUs.
 

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,840
Location
Montreal
Mantle Comes to Mobile APUs

Mantle Comes to Mobile APUs


In order to understand where Mantle is coming from, we need to go back in time and take the Playstation 3 as an example of how AMD wants to change the way games interact with a PC’s graphics subsystem. While the PS3’s Cell processor and its associated graphics core are extremely hard to program for, games like Uncharted 3 and The Last of Us boast visuals that are equal to if not better than some of today’s newest PC games which run on hardware that was unimaginable when Sony launched their console.

So how was this seemingly impossible feat accomplished? Consoles give developers easier access to the graphics subsystem without messy driver stacks, loads of API overhead, a cluttered OS and other unnecessary eccentricities eating up valuable resources. As a result, console games are able to fully utilize a given resource pool and allow programmers to do more with less. In some cases (the PS3 is another excellent example of this) the flow towards true utilization takes a bit longer as programmers have to literally relearn how to approach their trade but AMD's focus here is to streamline the whole process.

MANTLE-4.jpg

Mantle has been created to reduce the number of obstacles placed before developers when they’re trying to create new PC titles or port games over from consoles. In the past, things like CPU optimizations and efficient inter-component communication have largely been pushed aside as developers struggled to come to grips with the wide range of PC hardware configurations being used. This leads to multi core CPUs remaining idle, the GPU’s on-die resources being wasted and a real lack of optimal performance conditions on the PC, regardless of its advanced hardware.

There’s also a very heavy software component when programming for the PC environment since developers routinely have to contend with a predominantly heavy driver stack and slowly evolving primary level software. That’s a problem since it leads to the software / memory interaction becoming a rather stringent traffic light, bottlenecking the flow of information between the CPU and GPU, limiting throughput.

DirectX 10 and DX11 have gone a long way towards addressing some of these roadblocks but their overall performance is still hindered by their high-level nature. They keep communication between the API, GPU, game and CPU under strict control, something developers don’t want to wade through. When using them, transmitting a large number of draw calls leads to a CPU bottleneck, meaning today’s graphics architectures can never realize their full potential.

MANTLE-1.jpg

This is where Mantle gets factored into the equation; not as a direct replacement for DirectX or OpenGL but rather as a complementary force. It’s an API that focuses on “bare metal”, low level programming with a thin, lightweight driver that effectively manages resource distribution, grants additional control over the graphics memory interface and optimizes those aforementioned draw-calls. Think of Mantle like a low level strafing run that targets key components rather than high level carpet bombing that may or may not achieve a given objective.

With a more direct line of access to the GPU, AMD is hoping that GCN’s performance could drastically increase through rendering efficiencies rather than having to throw raw horsepower at problems. Opening up new rendering techniques which aren’t tied at the hip to today’s primary APIs is also a possibility. Theoretically, this could allow Mantle to process a ninefold increase in draw-calls and more importantly, it will ensure optimizations can be carried over from the console version of a game to the PC and vice versa.

MANTLE-2.jpg

There are also some notable speedbumps to this approach as well. While the high-level API (in this case DirectX / Direct3D) will remain the same across multiple hardware and product classes, Mantle is only compatible with GCN. This is great for Kaveri since it houses a GCN-based graphics processor within its confines.

It goes without saying that AMD has won the next generation console race with the Jaguar APU on both Xbox One and PS4 so leveraging those design wins is an integral part of their future strategy. But very little has been said about the high-level and lower-level APIs being used within those products, primarily the Xbox One. Direct3D 11.2 is a given but no one could point a finger at the low-level API. Microsoft has been forthcoming by saying that it isn't Mantle but the inclusion of native DirectX HLSL compatibility could go a long way towards making AMD’s cross-platform dreams come true.

In many ways, this approach reminds us of 3dfx’s Glide, another low-level application programming interface developed years ago but doomed to failure due to a lack of developer support and its parent company’s eventual demise.

KAVERI-APU-24.png

Mantle is particularly important for APUs like Kaveri since it offers a significant performance boost in supporting titles. It can take an A8-7600’s mediocre in-game framerates and bring them up to the mid-level discrete threshold, providing playable experience at reasonable detail settings. For example, AMD claims they’ve realized a 45% jump in Battlefield 4 performance by switching to Mantle while other games can benefit even more.
 

SKYMTL

HardwareCanuck Review Editor
Staff member
Joined
Feb 26, 2007
Messages
12,840
Location
Montreal
Benchmarks & Preliminary Thoughts

Benchmarks & Preliminary Thoughts


At this point in time the amount of information we have about mobile Kaveri’s performance is slim to none with the only available benchmarks being available directly from AMD. That means they get to pick and choose scenarios where their architecture is able to shine, in other words those with increased GPU processing. Make no mistake about it though; many of today’s most popular programs are going the GPU / x86 core co-processing route so the increased compute horsepower AMD brings to the table will be a welcome addition.

1.png

The first set of benchmarks consists of 3DMark, PCMark 8 and the compute-focused BasemarkCL. Here we see AMD’s ultra low voltage lineup remaining consistently ahead of Intel’s comparable U-series Haswell chips. However, what this doesn’t show is comparable performance in CPU-related tasks like spreadsheet tabulation, image manipulation or media conversion using applications that don’t support GPU acceleration. In those situations, the Kaveri chips will likely be trailing by a significant margin.

While it may sound like we don’t have high expectations for Kaveri as a mobile platform, that couldn’t be further from the truth. What the results above point towards is an important ray of hope for AMD and their aspirations within the mobile market. The FX-7500 is able to compete against one of Intel’s best ultra low voltage processors and that’s something impressive for an architecture which was considered the underdog until not that long ago.

2.png

Since the commercial ULV APUs nearly mirror the specifications of their mass market siblings, performance is quite similar, as is the immediate competition. The only oddity here is the A6-7050B which is the lowest-end product in AMD’s mobile Kaveri lineup. It lines up quite somewhere between the Core i3-4010U and the entry level Pentium 3556U. Whether or not these metrics will find clients within their intended market is anyone guess at this point but Kaveri does seem to have potential.


Preliminary Thoughts


Kaveri’s entry into the mobile / notebook market has been a long time coming but the long and drawn out process shouldn’t dim anyone’s excitement. With a combination of fast-enough x86 performance, class-leading compute potential and some worthwhile new features, the new APUs could prove to be a hit.

We do however have some initial reservations about the future plans for this platform since many of its claimed benefits hinge on matters that are beyond AMD’s direct control. Mantle, TrueAudio, hUMA, hQ and a litany of other features depend upon developers to create programs which properly harness their dormant potential. AMD has taken a “build it and they will come” mentality and thus far, software hasn’t quite kept pace. Without that key ingredient, the carefully built APU Heterogeneous System Architecture’s infrastructure will remain largely underutilized.

Unfortunately, as you probably already noticed, there were two glaring omissions from Kaveri’s information: hardly any talk about comparative battery life improvements versus the outgoing Richland architecture and a complete avoidance of the P-series’ performance. AMD should be flying the FX-7600P’s flag high but that didn’t happen. Instead, plenty of highlights were provided of the ULV parts.

There are quite a few rays of hope here as well. The reincarnation of AMD’s FX-series is a step in the right direction even though overclocking won’t be available on these particular SKUs. They represent a return to competitiveness for the APU lineup. There are even some preliminary design wins with the Lenovo Z-series and Acer Aspire E5-521 going the Kaveri route for at least some of their configurations.

With Beema and Mullins taking care of things on the slim and light front, Kaveri notebook APUs are poised for a successful launch into other segments. From all indications, AMD could have a winner on their hands provided enough developers are able to utilize the architecture’s HSA feature set. If that doesn’t happen Kaveri may end up relegated to the same low-end notebooks even though it should be able to achieve so much more. We’ll keep our fingers crossed.
 

Latest posts

Top