Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The question was: which DIMM should we replace?

On server-class machines, ECC errors often also show up in the system event log, so one can run "ipmitool sel list" and inspect the most recent messages, and they often point to the failing DIMM in a nomenclature that corresponds to how the slots are labelled on the mainboard or in its manual.

In this case, they are using a "gaming" mainboard, so this strategy probably doesn't work (no nice system event log).



System firmware can (but not always does) include a mapping between DIMM identifiers as exported by the Linux EDAC subsystem, and DIMM sockets on the mainboard. In absence of such a mapping, you can provide on yourself via `edac-ctl --register-labels`. Of course, someone will have to have figured out what that mapping actually is (but one can do that oneself, given a little patience) first :)


Most modern system (since 2014-2016?) supports WHEA, which allows the OS to get notification and write it to the OS system log.

Not sure if this would be seen in dmesg.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: