Checking Out Machine Check Exception (MCE) Errors in Linux

Orange Pi Development Boards

Machine Check Exception Error Linux

I recently reviewed ODROID-H2 with Ubuntu 19.04, and noticed some errors messages in the kernel log of the Intel Celeron J4105 single board computer while running SBC-Bench benchmark:


I did not know what do make of those errors, but I was told I would get more details with mcelog which can be installed as follows:


There’s just one little problem: it’s not in Ubuntu 19.04 repository, and a bug report mentions mcelog is not deprecated, and remove from Ubuntu 18.04 Bionic onwards. Instead, we’re being told the mcelog package functionality has been replaced by rasdaemon.

But before looking into the utilities, let’s find out what Machine Check Exception (MCE) is all about from ArchLinux Wiki:

A machine check exception (MCE) is an error generated by the CPU when the CPU detects that a hardware error or failure has occurred.

Machine check exceptions (MCEs) can occur for a variety of reasons ranging from undesired or out-of-spec voltages from the power supply, from cosmic radiation flipping bits in memory DIMMs or the CPU, or from other miscellaneous faults, including faulty software triggering hardware errors.

Hardware error should probably be taken seriously. Let’s investigate how to run the tools. First, I try to install mcelog from Ubuntu 16.04:


Oh good! It could install… Let’s run some commands:


Nothing interesting shows up here, but the file /var/log/mcelog is now up, and we can see details about the errors:


But let’s also try the recommended rasdaemon to see if we can get similar details.

Installation:


It looks like the service will not start automatically upon installation, so a reboot may be needed, or simply run the following command:


I ran a few commands and at first, it looked like some driver may be needed:


This should be related to EDAC drivers that are used for ECC memory according to a thread on Grokbase. Gemini Lake processors do not support ECC memory, so I probably don’t need it.

Running one more command to show the summary of errors, and we’re getting somewhere:


12 corrected error related to the L2 cache. We can get the full details with the appropriate command:


The status is green which means everything still works, but the utility reports a “large number of corrected cache errors”, and the “system (is) operating, but might lead to uncorrected errors soon” (See source code). It happens only a few times a day, and I’m not sure what can be done about the cache since it’s not something that can be changed as it’s embedded into the processor, maybe it’s just an issue with the processor I’m running. If somebody has an ODROID-H2 running, it may be useful to check out the kernel log with dmesg to see if you’ve got the same errors. If you do, please also indicate whether you have a board from the first batch (November 2018) or one of the new ODROID-H2 Rev B boards.

Support CNX Software - Donate via PayPal or become a Patron on Patreon

35
Leave a Reply

avatar
5 Comment threads
30 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
8 Comment authors
Jean-Luc Aufranc (CNXSoft)AvraEversortkaiserDavid Willmore Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
tkaiser
Guest
tkaiser

> the file /var/log/mcelog is now up

That’s how we monitor x86 commodity hardware: installing the mcelog package, then defining a primitive rule watching the size of /var/log/mcelog. Once the size exceeds 0 you have a problem.

As for the occurrences of these 2nd level cache errors: most probably they only occur once data is pumped through the CPU cores (e.g. running 7-zip or cpuminer as part of sbc-bench for example).

theguyuk
Guest
theguyuk

Might be worth also posting on Odroid H2 forum, issues or Ubuntu, maybe raise the number of board users reading this?

Eversor
Guest
Eversor

Hi there,

This is not normal. One error once in a couple months due to cosmic rays, that is normal.

You are being warned because at that rate there is something wrong with the hardware or too much EMI from another device.

tkaiser
Guest
tkaiser

> at that rate there is something wrong with the hardware

So far only ECCed (corrected) single bit flips in L2 cache. Harms slightly performance and will only be a real issue once two bits flip at the same time. While I wouldn’t trust such a CPU that much for the average use case a Gemini Lake box is taken (media center, desktop) this shouldn’t matter that much.

David Willmore
Guest
David Willmore

Physics is physics. You’re going to get bit flips in caches. Beta particles from C-14 decomposition. Beta particles from the Si itself, etc. Then there’s cosmic rays that you just can’t avoid as they’re everywhere. As transistors get smaller and smaller, Johnson–Nyquist noise becomes a bigger issue. The error may have even occured due to a transmission error on the chip–the right value may have been stored, but it got corrupted between the storage element and the ECC block.

As long as the ECC is catching single bit errors at a relatively low rate, there’s noting to worry about. If it’s seeing double bit errors, then it’s time to be concerned.

willy
Guest
willy

Sometimes there are 4 in the same second, this is far too much, something is busted in this machine, and its ability to recover from all the events you enumerated is affected by this existing one. The probability of two-bit *unrelated* errors remains very low. But if the hardware is defective, the probability of two-bit errors in the same cache line cannot be dismissed.

theguyuk
Guest
theguyuk

Have Odroid replied yet?

Avra
Guest
Avra

Thank you for reporting MCE problem on Gemini Lake. It helped me not to buy it. I already have MCE errors on Apollo Lake Asrock J3455-ITX and thought that problem will not show on Gemini Lake. Unfortunately it does not seam to be the case…

[email protected]:~$ uname -a
Linux falcon 4.19.0-0.bpo.2-amd64 #1 SMP Debian 4.19.16-1~bpo9+1 (2019-02-07) x86_64 GNU/Linux
[email protected]:~$ dmesg | grep microcode
dmesg: read kernel buffer failed: Operation not permitted
[email protected]:~$ sudo dmesg | grep microcode
[sudo] password for avra:
[ 0.961770] mce: [Hardware Error]: PROCESSOR 0:506c9 TIME 1564253320 SOCKET 0 APIC 0 microcode 1e
[ 3.731218] microcode: sig=0x506c9, pf=0x1, revision=0x1e
[ 3.731754] microcode: Microcode Update Driver: v2.2.
[email protected]:~$ sudo dmesg | grep mce
[ 0.935565] mce: CPU supports 7 MCE banks
[ 0.961475] mce: [Hardware Error]: Machine check events logged
[ 0.961569] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: a600000000020408
[ 0.961673] mce: [Hardware Error]: TSC 0 ADDR fef135c0
[ 0.961770] mce: [Hardware Error]: PROCESSOR 0:506c9 TIME 1564253320 SOCKET 0 APIC 0 microcode 1e