ECC Memory & AMD’s Ryzen – A Deep Dive
Share:
Success….Sort Of
Now everything seems to be lining up nicely with respect to ECC being enabled on this new AM4 platform, but without verifying that the operating systems are logging corrections and errors, we can’t make a definitive claim. The hardware might be silently correcting single-bit errors and even detecting ‘catastrophic’ two-bit errors, but there’s no way to know without a log.
With this in mind, we decided to intentionally cause some memory errors to see if the overall error detection and correction capabilities of this platform were actually functioning as intended. Since we don’t have our own particle accelerator to bombard the memory modules with in order to cause radiation-based errors, we settled on a simpler solution: overclocking. Instead of pushing up the memory frequency, we decided to cause instability by tightening the timings. Much to our surprise, the Crucial DDR4-2400 ECC modules proved to be quite remarkable. In order to cause instability we had to tighten the timings from 17-17-17-17-39 all the way down to 14-14-12-11-21. Sticking to Ubuntu for now, we ran the ‘Stress’ utility (sudo apt-get install stress) to stress the memory with 50 processes, each requiring 256MB, for a grand total of 12.8GB of heavy RAM usage.
What happened next? Well as terrible clickbait headline writers across the internet would say, the results will shock you:
Corrected error, corrected error, corrected error, …UNCORRECTED ERROR.
So we have good news and we have bad news. The system had absolutely no problems detecting and correcting single-bit errors, otherwise known as soft errors. It corrected literally hundreds of single-bit soft errors without missing a beat. This is why people invest in an ECC-capable platform, so that corrected error (CE) events don’t corrupt any data and they don’t bring down the system. These single-bit errors are detected, corrected, and logged. Usually by the motherboard and operating system, but right now only by the operating system.
HOWEVER, things are not quite perfect. On that last line you will notice “1 UE”. That is an uncorrected error (UE), otherwise known as a two-bit error or a hard error. Two-bit errors cannot be corrected by ECC memory. What is supposed to happen when they occur is that they should be detected, logged and ideally the system should be immediately halted. These are considered fatal errors and they can easily cause data corruption if the system is not quickly halted and/or rebooted. Regrettably, only 2 of the 3 steps happened. The hard error was detected and it was logged, but the system kept running. The only reason that it’s the last line on that image is because we immediately took a screenshot just in case the system would halt, but that never happened.
Since we have now definitely established that some form of ECC is functional, let’s move back to Windows 10 Pro and see how that operating system handles memory instability. As you will remember, nothing but AIDA64 suggested that ECC was enabled in Windows. However, by opening up the Event Viewer, and keeping an eye out for any Windows Hardware Error Architecture (WHEA) errors, we might be able to determine whether ECC is actually working:
Clearly, Windows 10 is able tap into the Ryzen’s error detection capabilities to report on Machine Check Exceptions (MCE). An MCE indicates that the processor has detected a hardware problem. However, they can often be triggered by ECC-enabled caches, and not necessarily by ECC memory. Since Ryzen processors have internal ECC detectors for the L1, L2 and L3 caches, and we can’t separate the cache from the memory, it is hard to determine which component is actually causing the errors. Logic dictates that it is the memory since we are intentionally causing memory instability. If the MCE was caused by a two-bit error, ideally that should have caused an immediate BSOD with a “WHEA_UNCORRECTABLE_ERROR” written near the bottom of the screen. That did not happen, but as demonstrated in Linux, this AM4 platform doesn’t appear to react as it should to those types of errors.
Regarding the possible single-bit errors, the error source is a “Corrected Machine Check” but that doesn’t help us since that happens on non-ECC systems as well. While the error type might be listed as “Cache Hierarchy Error”, clicking on the details tab reveals “ErrorType 9” which means that a memory hierarchy error has occurred.
The reason that this is all way more complicated than it should be is because Windows doesn’t recognize ECC as being enabled, so there’s no reason for it to present the Event Data using the tidPlatformMemoryError template that would actually gives us proper ECC detection and correction information.
When we installed a non-ECC memory kit and induced extreme levels of instability, the Event Viewer logged zero new WHEA errors. Is that proof enough that Windows is indeed able to detect and log memory errors when ECC memory is installed? Perhaps. However, we still have no indication that even single-bit errors are actually being corrected, so the situation is much worse than in Linux.
ECC can function without the involvement or even the acknowledgement of the operating system to a certain extent, so we can plausibly assume that the ECC feature is performing identically in Windows as it did in Linux. However, obviously, we cannot confirm it with 100% certainty.
In conclusion, what is currently available on the AM4 platform is an incomplete implementation of ECC. This is very likely why motherboard manufacturers have been relatively hesitant about claiming that their products support ECC memory in ECC mode. Based on our findings, there is clearly some level of ECC functionality that is working right now, but it does not cover the full spectrum of memory error detection and correction. Having said that, the status quo is arguably better than nothing, especially since single-bit errors are much more likely than multi-bit errors (which are often caused by a failing memory module), so I suspect that many people will still want the extra protection that is available right now.
While actual ECC validation will likely never occur on this consumer platform, if public interest in this feature keeps growing we fully expect motherboard manufacturers to step up to the plate and improve their ECC support. However, we strongly suspect that AMD will first have to release an update to their CPU microcode to fully unlock all of the necessary settings. Furthermore, there definitely needs to be some work done at the operating system level to let users know when ECC is enabled and what it is doing, more so on the Windows side than the Linux one.