What's new
  • Please do not post any links until you have 3 posts as they will automatically be rejected to prevent SPAM. Many words are also blocked due to being used in SPAM Messages. Thanks!

ECC Memory & AMD's Ryzen - A Deep Dive Comment Thread

Ashram

New member
Joined
Apr 1, 2017
Messages
1
great article, which firmware revision?

Hi,

this article ist great, especially the deatils regarding 1bit correction and 2bit detection.
Also the point of 2bit detection not triggering expected behaviour is really great!

But I could not find any details about the firmware (bios/uefi) revision used during this testing.
This would have been really valuable information!

I'm currently planning to build up multiple rigs using RyZen 1700 (TDP 65W) + 4x Kingston KVR21E15D8/16 16GB DDR4 ECC modules, so this article is really helpful as I was/am not yet aware of mainboards which really/officially support ECC to some extend.

I hope we can see some information regarding compatibiilty of those KVR21E15D8/16 with the Taichi in the very near future, as this would be one of my favored combinations. (Does anyone of you have some information regarding compatibility?)

Indeed, 2bit detection not triggering as expected is "not so nice", but working 1bit correction is already great, as this is really better than nothing at all, what would be the case using non-ecc memory.
(I know, using Xeon E3 would be an alternative for ECC support, but the number of concurrent threads we get per money spent simply votes against those Intel CPUs, regardless of performance offered per thread.)

Thank you very much for this article!
 

MAC

Associate Review Editor
Joined
Nov 8, 2006
Messages
1,086
Location
Montreal
Hi,

this article ist great, especially the deatils regarding 1bit correction and 2bit detection.
Also the point of 2bit detection not triggering expected behaviour is really great!

But I could not find any details about the firmware (bios/uefi) revision used during this testing.
This would have been really valuable information!

I'm currently planning to build up multiple rigs using RyZen 1700 (TDP 65W) + 4x Kingston KVR21E15D8/16 16GB DDR4 ECC modules, so this article is really helpful as I was/am not yet aware of mainboards which really/officially support ECC to some extend.

I hope we can see some information regarding compatibiilty of those KVR21E15D8/16 with the Taichi in the very near future, as this would be one of my favored combinations. (Does anyone of you have some information regarding compatibility?)

Indeed, 2bit detection not triggering as expected is "not so nice", but working 1bit correction is already great, as this is really better than nothing at all, what would be the case using non-ecc memory.
(I know, using Xeon E3 would be an alternative for ECC support, but the number of concurrent threads we get per money spent simply votes against those Intel CPUs, regardless of performance offered per thread.)

Thank you very much for this article!

Glad you liked it.

I was using the 1.55 Beta bios, they are now up to 1.94A Beta. I wouldn't expect any change on the ECC front yet, but memory compatibility should be better.

Those Kingston modules you listed are pretty much designed to be highly compatible, and looking at their specs there's of concern that stands out to me.
 

chithanh

New member
Joined
Mar 31, 2017
Messages
4
nothing happened aside from the UEs being detected. So based on my understanding of ECC I made the caution-minded conclusion that nothing was being done behind the scenes with respect to correcting multi-bit errors.
Did you run the test again with mce=0 and still no kernel panic on uncorrectable error?
 

sor

New member
Joined
Apr 2, 2017
Messages
1
At least for Linux, the default behaviour is to terminate the affected process with SIGBUS if the uncorrectable error happens in userspace, and panic if the error happens in kernel space.
See also linux/Documentation/x86/x86_64/boot-options.txt and linux/arch/x86/kernel/cpu/mcheck/mce.c

You can boot with mce=0 kernel parameter to always cause a kernel panic on uncorrectable errors.

There's a better way to troubleshoot this than trial and error.The behavior is managed by the edac module, so we just need to check the modules parameters and what they're set to, then change those if necessary. The documentation is https://www.kernel.org/doc/html/latest/admin-guide/ras.html, the module parameter in question is "panic_on_ue". At any rate, I wouldn't expect the motherboard to be responsible for lack of OS halt.
 

danielocdh

New member
Joined
Apr 4, 2017
Messages
2
Excellent article, although I'm a little sad there are no conclusive results as this is the only thing keeping me from building my new PC.
 

MarcT

New member
Joined
Apr 5, 2017
Messages
1
Interesting article - many thanks for putting this together!

I'm looking to replace my ancient AMD Phenom 9850 based workstation (which has ECC RAM) with a Ryzen with ECC RAM.
Comparing your EDAC output with mine (same kernel - 4.10.1), the memory controller in mine is actually showing the memory modules - which seems to be missing from your output:

root@anvil:~# dmesg | grep -i edac
[ 8.046849] EDAC MC: Ver: 3.0.0
[ 8.050061] EDAC amd64: DRAM ECC enabled.
[ 8.050069] EDAC amd64: F10h detected (node 0).
[ 8.050093] EDAC MC: DCT0 chip selects:
[ 8.050095] EDAC amd64: MC: 0: 1024MB 1: 1024MB
[ 8.050100] EDAC amd64: MC: 2: 2048MB 3: 0MB
[ 8.050104] EDAC amd64: MC: 4: 0MB 5: 0MB
[ 8.050109] EDAC amd64: MC: 6: 0MB 7: 0MB
[ 8.050113] EDAC MC: DCT1 chip selects:
[ 8.050114] EDAC amd64: MC: 0: 1024MB 1: 1024MB
[ 8.050119] EDAC amd64: MC: 2: 2048MB 3: 0MB
[ 8.050124] EDAC amd64: MC: 4: 0MB 5: 0MB
[ 8.050128] EDAC amd64: MC: 6: 0MB 7: 0MB
[ 8.050133] EDAC amd64: using x4 syndromes.
[ 8.050138] EDAC amd64: MCT channel count: 2
[ 8.050299] EDAC MC0: Giving out device to module amd64_edac controller F10h: DEV 0000:00:18.3 (INTERRUPT)
[ 8.050323] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.2 (POLLED)
[ 8.050331] AMD64 EDAC driver v3.4.0

This machine has 8GB RAM made up of 2 x dual rank 2GB DIMMS & 2 x single rank 2GB DIMMS.

Anyway, the fact the Ryzen system reported CE and UE errors is great news.
I'm planning to go with the Taichi board and 32GB (2x16GB) of Crucial CT2K16G4WFD824A ECC RAM. I'll let you know how I get on.
 

MAC

Associate Review Editor
Joined
Nov 8, 2006
Messages
1,086
Location
Montreal
Interesting article - many thanks for putting this together!

I'm looking to replace my ancient AMD Phenom 9850 based workstation (which has ECC RAM) with a Ryzen with ECC RAM.
Comparing your EDAC output with mine (same kernel - 4.10.1), the memory controller in mine is actually showing the memory modules - which seems to be missing from your output:

This machine has 8GB RAM made up of 2 x dual rank 2GB DIMMS & 2 x single rank 2GB DIMMS.

Anyway, the fact it reported CE and UE errors is great news.
I'm planning to go with the Taichi board and 32GB (2x16GB) of Crucial CT2K16G4WFD824A ECC RAM. I'll let you know how I get on.

Glad you liked it!

Yeah, I suspect that EDAC will get an update that helps it better recognize the memory modules on this platform.

I'm interested in hearing how your build goes, since if 16GB modules work fine then workstations with 64GB of ECC RAM is a tantalizing possibility.
 

danielocdh

New member
Joined
Apr 4, 2017
Messages
2
Not one is offering an enable/disable option in the UEFI, and we haven't seen anyone but ASRock and ASUS have any ECC settings available at the moment.

What are the ASUS ECC settings, which mobos have it?
Thanks
 

pepoluan

New member
Joined
Jun 14, 2017
Messages
1
Great article! :thumb:

I registered just to say a heartfelt THANK YOU for writing this!

Despite the sad status quo that no mobo producer is currently fully supporting ECC, the key is definitely the CPU. And by unlocking ECC support in the CPU (I always love AMD because they always unlock features across the board), it means that I can at least start using Ryzen with any Good Enough(tm) motherboard available... and swap out the motherboard when one of the manufacturers decide to bite the bullet and outright support ECC.

Ryzen, here I come! :clap:

Again, thanks for an informative -- and investigative -- article!
 

chithanh

New member
Joined
Mar 31, 2017
Messages
4
@MAC

Are you still planning to update the article with a correct description of how the Linux MCE handler behaves? It seems that your article makes the rounds, including the misinformation that an uncorrectable error should always halt the system.
 
Top