What's new
  • Please do not post any links until you have 3 posts as they will automatically be rejected to prevent SPAM. Many words are also blocked due to being used in SPAM Messages. Thanks!

ECC Memory & AMD's Ryzen - A Deep Dive

MAC

Associate Review Editor
Joined
Nov 8, 2006
Messages
1,141
Location
Montreal
One of the more interesting aspects of the pre- and post-Ryzen launch was how many people were wondering about support for ECC memory, otherwise known as error-correcting code memory. Every forum had at least one thread and most articles with a comments section had people talking about this relatively rare feature. It all started when leaks and then online product pages for AM4 motherboards appeared, since their specifications lists showed support for both ECC and non-ECC DDR4 memory.

The reason that ECC is much more popular among AMD consumers is that it's a feature that the company has never blocked. While Intel artificially limits ECC support to pricey Xeon processors or low-end Pentium models that no one would ever use for serious work, AMD have always quietly supported ECC memory. It might not be advertised, and it's certainly not validated, but it's been there on AM1/AM2/AM3 platforms for the adventurous to make use of and there's a die-hard contingent that do.


While we are not going to be talking about the merits of ECC memory, nor going into great technical detail about how it works - Crucial has a concise article available - the primary thing that ECC memory protects you from is what is known as a single-bit error (one DRAM bit cell that is stuck or flipped), due to cosmic radiation, electromigration, physical contact issues, or just some type of hardware failure. Uncorrected, bit flips can cause programs to terminate, they can result in errors in output data, or they can have no impact at all if the flipped bit is in a fortunate place. Nowadays, you also have to worry about a row hammer attack, which is a rapid number of malicious accesses that can cause bit flips and give a skilled hacker read-write access to all physical memory. ECC can prevent that from happening.

There are perfectly valid reasons for home-users to be using ECC; some people are running a DIY NAS, others a home lab, some dabble with scientific tasks like LAMMPS, while others just do work that is personally important, and they are all willing to spend a small premium to ensure the reliability of their data and of their results. Those who run DIY NAS software - like FreeNAS - prefer to use ECC memory since it is one element that can help ensure file integrity. Some people believe that ECC is especially important when using file systems like ZFS or btrfs that perform 'scrubbing', which is a process that reads all of the data and metadata to see if it still matches the file system checksum. If errors have been introduced, and the checksums no longer match, things can go sideways in a hurry. It is exceedingly rare though, and similar data corruption can happen on any file system.

Now the whole Ryzen + ECC question reached peak hype levels when AMD did an AMA on Reddit the day of the Ryzen 7 launch, and three AMD bigshots weighed in on the ECC issue:






Dr. Lisa Su is the President and CEO of AMD, Robert Hallock is Head of Global Technical Marketing, and James Prior is a CPU Product Manager. Clearly, these were three people that knew what they were talking about when it comes to Ryzen.

Based on the above, it is pretty clear that AMD has not done any quality assurance testing, and they don't want to have to provide any official support for ECC on this mainstream consumer platform. However, we would be surprised to not see them validate it on the upcoming high-end desktop platform (HEDT), which will obviously have workstation roots. Instead of pulling an Intel move and simply disabling ECC support altogether, they have opted to allow motherboard manufacturers to implement the feature as they see fit.

While that Reddit AMA confirmed many peoples hopes and dreams, it also created a bunch of other questions, and that's where we decided to step in. On a side, you will want to check out the numerous hyperlinks that we embedded throughout this article, since most of them are zoomable images, while others are just useful links.

We reached out to Crucial - the consumer arm of semiconductor giant Micron Technology - to see if they could hook us up with some ECC memory, and look what we received in the mail:


These are two modules of Crucial CT8G4WFS824A unbuffered ECC memory. They are based on the ECC UDIMM form factor - otherwise known as EUDIMM - with memory speeds of DDR4-2400, timings of 17-17-17-39, and a default voltage of 1.20V. We specifically requested these modules because they were faster than regular DDR4-2133 ECC modules, and because they were single rank. While using dual rank modules would have worked as well, we just wanted to ensure guaranteed compatibility. Crucial does offer larger dual rank 16GB DDR4-2400 ECC modules, as well as faster dual rank 8GB DDR4-2666 ECC modules, though it seems unlikely that you could run four modules at that latter speed given Ryzen's current (but constantly improving) memory limitations.

As you can see, these modules are manufactured with Micron D9TBH ICs, which are aren't exclusively used on ECC memory modules. You can actually find them in various conventional DDR4-2400 memory kits. However, what makes these Crucial ECC modules special is that the ICs are binned better, they have embedded thermal sensors that popular programs like AIDA64 and HWinfo can read, and they feature one additional IC compared to the usual eight. The extra IC is what makes ECC possible. Whereas a standard DRAM is 64-bit (8x8 bit), ECC is 72-bit, and those extra 8 bits of data are used by the error detection and correction mechanisms to prevent the single bit errors that we discussed above.


As you can see in the screenshots, these modules will boot up with 17-17-17-39-56-1T timings, or more accurately 17-17-17-17-39-56-1T timings since that's how timings are presented in the UEFI on the ASRock X370 Taichi motherboard that we used for this article. These are at least five JEDEC DDR4-2400 profiles loaded onto these modules to ensure maximum compatibility, so no matter what you will be able to run at that frequency.
 
Last edited by a moderator:

MAC

Associate Review Editor
Joined
Nov 8, 2006
Messages
1,141
Location
Montreal
Understanding Motherboard & UEFI Limits

Understanding Motherboard & UEFI Limits


Now that we have confirmed that the processors support ECC, it's time to talk a bit about the role that motherboard and the UEFI plays. The star of this page is going to be the ASRock X370 Taichi that we just recently reviewed.


ASRock X370 Taichi memory slot traces

As we explained on the previous page, AMD platforms have historically always supported ECC. The memory controllers had the feature enabled, and all that really mattered on the motherboard side was the presence of check bit traces between the memory slots and the processor that could to route the ECC signals. Thankfully, the motherboards pretty much always were equipped with those traces, even on models that never advertised support for ECC memory. Based on the handful of AM4 motherboards that we have inspected that once again appears to be the case.

ASRock X370 Taichi ECC-related settings - BIOS Version 1.55 Beta

Once you have an ECC-enabled memory controller, a motherboard with the right traces, and a few sticks of ECC memory, the next step is whether the BIOS/UEFI properly supports ECC. This is where things start getting a little bit iffy. AMD placed all the responsibility for ECC support on the motherboard manufacturers, and they aren't really willing to step up to the plate and assume that responsibility...you will find out why in the conclusion. As a result, while most motherboard manufacturers have now come to acknowledge that their motherboards are indeed ECC enabled, that is the extent of their involvement. Not one is offering an enable/disable option in the UEFI, and we haven't seen anyone but ASRock and ASUS have any ECC settings available at the moment.

Given the fact that they are ahead most of the competition in this area, using the ASRock X370 Taichi for this article was an easy decision. While we did not have to manually adjust any of the three ECC-related settings (DRAM scrub time/Redirect scrubber control/Data Poisoning) for this article, their presence gives us an idea of what settings the other motherboard manufacturers could be unlocking.

This lack of settings severely hampers the overall ECC functionality, since a big part of it is that the motherboard should be able to log errors. Right now, no such logging capability exists. Thankfully, there is a possible software solution. The operating system - if it fully supports this new AM4 platform - should have the ability to log errors and corrections. If it does not, the hardware might be silently correcting single-bit errors and even detecting 'catastrophic' two-bit errors, but you will never know about it since there will be no log. That's what we are going to look into next.

To conclude this page, we strongly suspect that just about every AM4 motherboard likely has ECC enabled, or at the very least will in the future. Most motherboard manufacturers certainly aren't actively supporting it, or even unlocking any of the features that accompany it, but they don't appear to be maliciously disabling it either. At this point in time, they simply have other way more important things on their plate, like improving memory support, overclocking, ensuring that IOMMU is functional, etc. Furthermore, we strongly suspect that they are presently unable to unlock all of the necessary settings without a newer CPU microcode from AMD.
 
Last edited:

MAC

Associate Review Editor
Joined
Nov 8, 2006
Messages
1,141
Location
Montreal
Looking for Answers in Windows

Looking for answers in Windows


With the presence of ECC on the processors confirmed, and the motherboard manufacturers not actively disabling it, it was time to see what Windows 10 Pro had to say about whether ECC was actually enabled on this new AM4 platform. As we established on the previous page, given the lack of motherboard-based ECC logging, we have to rely on the operating to do so.

We started off with two popular commands, both of which can be run in command prompt:


The first command (wmic memphysical get memoryerrorcorrection) essentially queries Windows as to whether it detects any form of memory error-correcting code functionality. Regrettably, 3 signifies "none". The other options could have been 2 (unknown), 4 (parity), 5 (single-bit ECC), or 6 (multi-bit ECC).

The following command (wmic memorychip get datawidth, totalwidth) queries Windows as to what the per-channel memory interface is. If the TotalWidth value is larger than the DataWidth value then ECC is enabled. Ideally, we would have liked to see 72 for the TotalWidth instead of 64, since that would have indicated that the OS detects a 72-bit memory bus width, which as we discussed in the first page is the norm for ECC memory. Once again, no such luck.

Now while that might seem like a terrible start, the truth of the matter is that despite being popular, both of the commands have proven themselves to be exceptionally unreliable on newer platforms. Similar results have been returned to those running Intel Xeon E3-1200 v4 and E3-1200 v5 workstations, despite the fact that those platforms were running ECC memory and absolutely had ECC fully enabled.

Therefore, it was time to see what some popular diagnostics/system information programs had to reveal:


Despite the fact that the SPD tab in CPU-Z hasn't properly reported correction in quite a while, this popular utility does have a handy system report feature that few people are aware of. We were hoping that it might provide us with some different type of information, but regrettably it also just showed that there was no memory correction enabled.

Next up was the very latest 5.47-3125 Beta version of HWiNFO64:



While HWiNFO did detect that our Crucial memory modules were in fact ECC capable, it still reported that the system did not have any type of memory correction enabled.

Another failure, so on to the always popular AIDA64, in this case AIDA64 Engineer version 5.80.4098 Beta:


As in the case of HWiNFO64, when we looked at the memory arrays, AIDA64 also reported there were was no error correction enabled.

However, we went through every possible category and found this in the chipset sub-menu:


Eureka! The first confirmation. Could it be wrong? Perhaps, so we installed a non-ECC memory kit and as expected AIDA64 reported ECC as disabled. Not only that, but if you pay attention to the DRAM Scrub Rate, the rate on the ECC memory is almost 500 times slower than on the non-ECC kit. Just to be clear, the scrub rate is not affected by differences in memory speeds or timings. Clearly, that gives us an indication that something is happening in the background. If you don't know what memory scrubbing is, there is a basic Wikipedia page available that explains it.

We also tried an even newer AIDA64 version (5.90.4200 stable), but the results were the same.

Our next attempt was just a shot in the dark:


Just for the hell of it, we loaded up Windows Server 2016 in the slight hope that an operating system that was designed to be used on exclusively ECC-enabled hardware might produce different results. We tried all of the same Windows commands, as well as all the same applications, and they all reported the same information as Windows 10 Pro.

Based on just the above information, it would not be prudent to say that ECC is enabled on this platform - at least in Windows - but clearly something is going on given the scrub rate. We also can't yet say whether Windows has the ability to log errors and corrections yet...but you will want to keep reading until the conclusion.

Since our efforts in Windows proved to be less than entirely fruitful, it was time to turn to Linux, which can be surprisingly quick at supporting new hardware due its lack of stifling corporate bureaucracy.
 
Last edited by a moderator:

MAC

Associate Review Editor
Joined
Nov 8, 2006
Messages
1,141
Location
Montreal
Turning to Linux

Turning to Linux


The very latest iteration of the Linux kernel - version 4.10 - fully supports AMD's new Ryzen processors, and arguably the easiest way to take advantage of that kernel is by installing Ubuntu 17.04 (Zesty Zapus). We are also once again fortunate to be using an ASRock X370 Taichi since we were not able to load Ubuntu on a GIGABYTE AX370-Gaming 5.

Since we are interested in determining whether ECC is functional on this new AM4 platform, the next step was to install edac-util, which is an incredibly useful program that reads and reports error detection and correction (EDAC) information. Specifically, this program can tell you if ECC is enabled, and if it is it will report any corrected error (CE) or uncorrected error (UE).

Let's see what we can find:


Success! One memory controller with ECC functionality detected, and no errors to report. We can now confirm that ECC is enabled in Linux.

For more information, we decided to use the dmesg (display message or driver message) command to see what the kernel and/or kernel modules had to say about ECC:


As you can see clear as day: "DRAM ECC enabled". While some of those RAM parameters are obviously not being read correctly, the mention of "x8 syndromes" is another confirmation since syndromes are the eight extra bits that are used for the error detection and correction process.

Just to be thorough, we installed non-ECC memory, ran the same command, and got an ECC disabled message.

We also tried the popular dmidecode utility (sudo dmidecode -t memory), but since it's not EDAC aware it did not detect ECC.

While this all looks incredibly positive, the next step is obviously to determine whether error detection and correction is actually working, and to what extent it is working. Will the operating system detect and log errors? Will the hardware function together to correct single-bit errors and ideally halt the system when there's a two-bit error? That is what we are going to find out next.
 
Last edited:

MAC

Associate Review Editor
Joined
Nov 8, 2006
Messages
1,141
Location
Montreal
Success....Sort Of

Success....Sort Of


Now everything seems to be lining up nicely with respect to ECC being enabled on this new AM4 platform, but without verifying that the operating systems are logging corrections and errors, we can't make a definitive claim. The hardware might be silently correcting single-bit errors and even detecting 'catastrophic' two-bit errors, but there's no way to know without a log.

With this in mind, we decided to intentionally cause some memory errors to see if the overall error detection and correction capabilities of this platform were actually functioning as intended. Since we don't have our own particle accelerator to bombard the memory modules with in order to cause radiation-based errors, we settled on a simpler solution: overclocking. Instead of pushing up the memory frequency, we decided to cause instability by tightening the timings. Much to our surprise, the Crucial DDR4-2400 ECC modules proved to be quite remarkable. In order to cause instability we had to tighten the timings from 17-17-17-17-39 all the way down to 14-14-12-11-21. Sticking to Ubuntu for now, we ran the 'Stress' utility (sudo apt-get install stress) to stress the memory with 50 processes, each requiring 256MB, for a grand total of 12.8GB of heavy RAM usage.

What happened next? Well as terrible clickbait headline writers across the internet would say, the results will shock you:


Corrected error, corrected error, corrected error, ...UNCORRECTED ERROR.

So we have good news and we have bad news. The system had absolutely no problems detecting and correcting single-bit errors, otherwise known as soft errors. It corrected literally hundreds of single-bit soft errors without missing a beat. This is why people invest in an ECC-capable platform, so that corrected error (CE) events don't corrupt any data and they don't bring down the system. These single-bit errors are detected, corrected, and logged. Usually by the motherboard and operating system, but right now only by the operating system.

HOWEVER, things are not quite perfect. On that last line you will notice "1 UE". That is an uncorrected error (UE), otherwise known as a two-bit error or a hard error. Two-bit errors cannot be corrected by ECC memory. What is supposed to happen when they occur is that they should be detected, logged and ideally the system should be immediately halted. These are considered fatal errors and they can easily cause data corruption if the system is not quickly halted and/or rebooted. Regrettably, only 2 of the 3 steps happened. The hard error was detected and it was logged, but the system kept running. The only reason that it's the last line on that image is because we immediately took a screenshot just in case the system would halt, but that never happened.

Since we have now definitely established that some form of ECC is functional, let's move back to Windows 10 Pro and see how that operating system handles memory instability. As you will remember, nothing but AIDA64 suggested that ECC was enabled in Windows. However, by opening up the Event Viewer, and keeping an eye out for any Windows Hardware Error Architecture (WHEA) errors, we might be able to determine whether ECC is actually working:



Possibly a two-bit error / hard error - Click to enlarge

Possibly many single-bit errors / soft errors - Click to enlarge

Clearly, Windows 10 is able tap into the Ryzen's error detection capabilities to report on Machine Check Exceptions (MCE). An MCE indicates that the processor has detected a hardware problem. However, they can often be triggered by ECC-enabled caches, and not necessarily by ECC memory. Since Ryzen processors have internal ECC detectors for the L1, L2 and L3 caches, and we can't separate the cache from the memory, it is hard to determine which component is actually causing the errors. Logic dictates that it is the memory since we are intentionally causing memory instability. If the MCE was caused by a two-bit error, ideally that should have caused an immediate BSOD with a "WHEA_UNCORRECTABLE_ERROR" written near the bottom of the screen. That did not happen, but as demonstrated in Linux, this AM4 platform doesn't appear to react as it should to those types of errors.

Regarding the possible single-bit errors, the error source is a "Corrected Machine Check" but that doesn't help us since that happens on non-ECC systems as well. While the error type might be listed as "Cache Hierarchy Error", clicking on the details tab reveals "ErrorType 9" which means that a memory hierarchy error has occurred.

The reason that this is all way more complicated than it should be is because Windows doesn't recognize ECC as being enabled, so there's no reason for it to present the Event Data using the tidPlatformMemoryError template that would actually gives us proper ECC detection and correction information.

When we installed a non-ECC memory kit and induced extreme levels of instability, the Event Viewer logged zero new WHEA errors. Is that proof enough that Windows is indeed able to detect and log memory errors when ECC memory is installed? Perhaps. However, we still have no indication that even single-bit errors are actually being corrected, so the situation is much worse than in Linux.

ECC can function without the involvement or even the acknowledgement of the operating system to a certain extent, so we can plausibly assume that the ECC feature is performing identically in Windows as it did in Linux. However, obviously, we cannot confirm it with 100% certainty.


In conclusion, what is currently available on the AM4 platform is an incomplete implementation of ECC. This is very likely why motherboard manufacturers have been relatively hesitant about claiming that their products support ECC memory in ECC mode. Based on our findings, there is clearly some level of ECC functionality that is working right now, but it does not cover the full spectrum of memory error detection and correction. Having said that, the status quo is arguably better than nothing, especially since single-bit errors are much more likely than multi-bit errors (which are often caused by a failing memory module), so I suspect that many people will still want the extra protection that is available right now.

While actual ECC validation will likely never occur on this consumer platform, if public interest in this feature keeps growing we fully expect motherboard manufacturers to step up to the plate and improve their ECC support. However, we strongly suspect that AMD will first have to release an update to their CPU microcode to fully unlock all of the necessary settings. Furthermore, there definitely needs to be some work done at the operating system level to let users know when ECC is enabled and what it is doing, more so on the Windows side than the Linux one.


We would like to thank Crucial for providing the ECC memory kit, and @tekwendell for the assistance with Linux.

If you have any comments or questions about this article, we would love to hear them.

 
Last edited:

Twitter

Top