What's new
  • Please do not post any links until you have 3 posts as they will automatically be rejected to prevent SPAM. Many words are also blocked due to being used in SPAM Messages. Thanks!

ECC Memory & AMD's Ryzen - A Deep Dive Comment Thread

JD

Moderator
Staff member
Joined
Jul 16, 2007
Messages
10,622
Location
Toronto, ON
Thanks, one of the migration issues it seems. OP has been edited.
 

Mastakilla

Member
Joined
Oct 22, 2019
Messages
13
After spending quite some time testing, I have an update on this great deep dive, but also some questions / issues with it.

Some background of why this is an update on this great deep dive
I'm doing my testing on an server-grade x470 Server Mobo (ASRock Rack) using a Ryzen 3600 CPU, using a BIOS based on AGESA 1.0.0.3 ABBA (not officially released yet by ASRock Rack) on the latest Windows 10 1903 and the latest Fedora Rawhide for Linux

Windows
1571955813388.png
For the first command: from the article: 2 (unknown), 3 (none), 4 (parity), 5 (single-bit ECC), or 6 (multi-bit ECC).
So that looks better then it used to be!
For the second command: Also that looks better (TotalWidth is larger than DataWidth), just "TotalWidth" is double (128) instead of the 72 that the article expected.

1571955987141.png
1571956039465.png
1571956058886.png
1571956065421.png
Also CPU-z, HWinfo64 and AIDA64 now correctly recognize the ECC RAM and AIDA64 also reports that it is enabled.

Linux
1571956328711.png
Also in Linux everything looks ok: 'DRAM ECC enabled' and 'using x16 syndromes'.

But then the actual testing
For this I've overclocked the memory from 1333Mhz to 1500Mhz, keeping all other timings the same. At 1533Mhz or 1567Mhz the mobo no longer posts and requires a clear CMOS to recover.

These are my default settings (bottom right are is the memory)
1571956612714.png
And these my overclocked settings
1571956641437.png

However, with the overclocked settings I'm failing to log any memory error at all on both Windows and Linux... :(
1571956781058.png
1571956757111.png
Both memtester, memtest86 and Prime95 Blend can run for hours without error at this speed.

I suspect that ECC actually does work and corrects many errors, but it doesn't report anything to any OS? (because just slightly increasing the frequency causes it to not post at all anymore).

Please read further in the next post...
 
Last edited:

Mastakilla

Member
Joined
Oct 22, 2019
Messages
13
In the IPMI I'm also not finding any errors being reported:
1571958023961.png
1571958029648.png

I also tried to disable the ECC functionality and see if I could make any of the stresstest programs crash or that the OS then would receive uncorrected errors reported (this would at least proove that my memory is "unstable" at this frequency).
But also that failed. Even after disabling ECC, I get no error in Linux, Windows (didn't check the IPMI in this scenario yet) and no crashes either.

I've used below BIOS settings for trying this (not sure if this is correct / sufficient though).

These settings are default (but show the BIOS maze I went through to get there ;) ):
1571957410023.png1571957419019.png
1571957423231.png1571957426834.png1571957430748.png

I've tried to change 'DRAM ECC Enable' to 'Disabled' and after that also 'DRAM UECC Retry' to 'Disabled':
1571957447542.png1571957450927.png

Can someone help me figure out how to fix my ECC error reporting to the OS (and/or the IPMI) or explain what I'm doing wrong?

Thanks!
 
Last edited:

Mr. Friendly

Well-known member
Joined
Nov 21, 2007
Messages
6,802
Location
British Columbia
this is some cool stuff. after reading the article again I can see it is in need of serious updating as it's 3 years old and the Ryzen platform has come a long way since then. a refresh could show the better overall picture now. :)
 

Entz

Well-known member
Joined
Jul 17, 2011
Messages
1,870
Location
Kelowna
Can someone help me figure out how to fix my ECC error reporting to the OS (and/or the IPMI) or explain what I'm doing wrong?
are you 100% sure you are actually having ecc errors? There is no guarantee that a bit flip will occur during stress testing or unstable memory. are you messing with timings or just frequency? The article did it with frequency.
 

Mastakilla

Member
Joined
Oct 22, 2019
Messages
13
Hi,

No, I'm not 100% sure that I'm actually having ecc errors. But it would surprise me if I didn't...

Also before trying it the above way, I accidently left all memory timings set to auto and when I started increasing the frequency, the board automatically loosened the timings, which gives a completely different scenario. With the loosened timing I could increase the frequency from 1333Mhz to 1967Mhz, before it stopped booting. But also in that scenario, there were no reports of memory errors at for example 1933Mhz.

If you can tell which timing I should decrease while keeping all other settings default, I'm ofcourse very happy to try if it makes a difference...
 

nToxik

Well-known member
Joined
Apr 7, 2008
Messages
193
There was similar testing done on the Unraid forums on Reddit as well as the official Unraid forums.

https://www.reddit.com/r/Amd/comments/cqu49x
For Ryzen builds, ECC 'looks' like it is functioning but it really isn't. I'm not sure if this is motherboard/vendor specific or not.
 

Entz

Well-known member
Joined
Jul 17, 2011
Messages
1,870
Location
Kelowna
If you can tell which timing I should decrease while keeping all other settings default, I'm ofcourse very happy to try if it makes a difference...
Yeah I am not sure what what it would take to simulate. You need to get the timings such that writes work perfectly fine and just a few reads will fail. Too many, or to big of an error (Unrecoverable) and the system will crash.

I have never overclocked ECC ram, as that is kinda counter productive, so I am not sure what it would take.

Assuming it is even working at all. I would expect them to show up in the IPMI side over the OS if it is a drive issue, and if that isn't working it likely isn't catching them or your just extremely lucky writing 10=reading 10 until you hit a speed then nothing works.
 

Mastakilla

Member
Joined
Oct 22, 2019
Messages
13
There was similar testing done on the Unraid forums on Reddit as well as the official Unraid forums.

https://www.reddit.com/r/Amd/comments/cqu49x
For Ryzen builds, ECC 'looks' like it is functioning but it really isn't. I'm not sure if this is motherboard/vendor specific or not.
Just read those links, then did some testing... Could it be that support is there since Linux kernel 5.4?

Don't know about it working well (couldn't confirm it yet with my testing as you can read earlier).

Ubuntu 19.10 (Linux kernel 5.3)

root@nas:~# find /lib/modules/5.3.0-19-generic/ | grep -i -E 'edac'
/lib/modules/5.3.0-19-generic/kernel/drivers/edac
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i7core_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/skx_edac.ko

/lib/modules/5.3.0-19-generic/kernel/drivers/edac/amd64_edac_mod.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i5100_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i10nm_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/x38_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i3000_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/sb_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i3200_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i7300_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i5400_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i82975x_edac.ko

/lib/modules/5.3.0-19-generic/kernel/drivers/edac/edac_mce_amd.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/e752x_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/pnd2_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/ie31200_edac.ko
/lib/modules/5.3.0-19-generic/kernel/drivers/edac/i5000_edac.ko
root@nas:~# apt list edac-utils
Listing... Done

edac-utils/eoan,now 0.18-1build1 amd64 [installed]
edac-utils/eoan 0.18-1build1 i386
root@nas:~# edac-util -vs
edac-util: EDAC drivers loaded. No memory controllers found
root@nas:~# edac-util -v
edac-util: Error: No memory controller data found.
root@nas:~#


Fedora Rawhide (Linux kernel 5.4)

[root@localhost ~]# find /lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/ | grep -i -E 'edac'
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac

/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/amd64_edac_mod.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/e752x_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/edac_mce_amd.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i10nm_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i3000_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i3200_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i5000_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i5100_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i5400_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i7300_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i7core_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/i82975x_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/ie31200_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/pnd2_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/sb_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/skx_edac.ko.xz
/lib/modules/5.4.0-0.rc3.git0.1.fc32.x86_64/kernel/drivers/edac/x38_edac.ko.xz
[root@localhost ~]# yum info edac-utils
Last metadata expiration check: 0:01:47 ago on Sun 27 Oct 2019 11:44:47 PM CET.
Installed Packages
Name : edac-utils
Version : 0.16
Release : 21.fc31
Architecture : x86_64
Size : 101 k
Source : edac-utils-0.16-21.fc31.src.rpm
Repository : @System
From repo : rawhide
Summary : Userspace helper for kernel EDAC drivers
URL : http://sourceforge.net/projects/edac-utils/
License : GPLv2+
Description : EDAC is the current set of drivers in the Linux kernel that handle
: detection of ECC errors from memory controllers for most chipsets
: on i386 and x86_64 architectures. This userspace component consists
: of an init script which makes sure EDAC drivers and DIMM labels
: are loaded at system startup, as well as a library and utility
: for reporting current error counts from the EDAC sysfs files.
[root@localhost ~]# edac-util -vs
edac-util: EDAC drivers are loaded. 1 MC detected:

mc0:F17h_M70h
[root@localhost ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
[root@localhost ~]#
 
Last edited:

Latest posts

Twitter

Top