Checking Out Machine Check Exception (MCE) Errors in Linux

Machine Check Exception Error Linux

I recently reviewed ODROID-H2 with Ubuntu 19.04, and noticed some errors messages in the kernel log of the Intel Celeron J4105 single board computer while running SBC-Bench benchmark:


I did not know what do make of those errors, but I was told I would get more details with mcelog which can be installed as follows:


There’s just one little problem: it’s not in Ubuntu 19.04 repository, and a bug report mentions mcelog is not deprecated, and remove from Ubuntu 18.04 Bionic onwards. Instead, we’re being told the mcelog package functionality has been replaced by rasdaemon.

But before looking into the utilities, let’s find out what Machine Check Exception (MCE) is all about from ArchLinux Wiki:

A machine check exception (MCE) is an error generated by the CPU when the CPU detects that a hardware error or failure has occurred.

Machine check exceptions (MCEs) can occur for a variety of reasons ranging from undesired or out-of-spec voltages from the power supply, from cosmic radiation flipping bits in memory DIMMs or the CPU, or from other miscellaneous faults, including faulty software triggering hardware errors.

Hardware error should probably be taken seriously. Let’s investigate how to run the tools. First, I try to install mcelog from Ubuntu 16.04:


Oh good! It could install… Let’s run some commands:


Nothing interesting shows up here, but the file /var/log/mcelog is now up, and we can see details about the errors:


But let’s also try the recommended rasdaemon to see if we can get similar details.

Installation:


It looks like the service will not start automatically upon installation, so a reboot may be needed, or simply run the following command:


I ran a few commands and at first, it looked like some driver may be needed:


This should be related to EDAC drivers that are used for ECC memory according to a thread on Grokbase. Gemini Lake processors do not support ECC memory, so I probably don’t need it.

Running one more command to show the summary of errors, and we’re getting somewhere:


12 corrected error related to the L2 cache. We can get the full details with the appropriate command:


The status is green which means everything still works, but the utility reports a “large number of corrected cache errors”, and the “system (is) operating, but might lead to uncorrected errors soon” (See source code). It happens only a few times a day, and I’m not sure what can be done about the cache since it’s not something that can be changed as it’s embedded into the processor, maybe it’s just an issue with the processor I’m running. If somebody has an ODROID-H2 running, it may be useful to check out the kernel log with dmesg to see if you’ve got the same errors. If you do, please also indicate whether you have a board from the first batch (November 2018) or one of the new ODROID-H2 Rev B boards.

Share this:
FacebookTwitterHacker NewsSlashdotRedditLinkedInPinterestFlipboardMeWeLineEmailShare

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

ROCK 5 ITX RK3588 mini-ITX motherboard

39 Replies to “Checking Out Machine Check Exception (MCE) Errors in Linux”

  1. > the file /var/log/mcelog is now up

    That’s how we monitor x86 commodity hardware: installing the mcelog package, then defining a primitive rule watching the size of /var/log/mcelog. Once the size exceeds 0 you have a problem.

    As for the occurrences of these 2nd level cache errors: most probably they only occur once data is pumped through the CPU cores (e.g. running 7-zip or cpuminer as part of sbc-bench for example).

    1. Could someone provide me with some help?

      I use a small firewall with IPFire and get mce errors. How is this to be interpreted?

  2. Might be worth also posting on Odroid H2 forum, issues or Ubuntu, maybe raise the number of board users reading this?

    1. I’m not sure it’s important yet. After all, the board works just fine. I’ll contact Hardkernel.

      1. Seem to be incompatible memory posts on Odroid forum. Could it be memory or hardware chips driver issues?

        1. I’m using the DDR4 modules provided by Hardkernel, and there aren’t any memory errors, only L2 cache errors.
          What the problem could potentially be is that the latest batch of Gemini Lake processors has some cache issues (TBC), which may explain why I have somewhat lower performance compared to earlier boards.

          1. Cache issues in CPUs do happen. I’ve been hit many times with those in UltraSparc processors for example (and my U5 is sick again due to this for the second time by the way). I think modern CPUs use parity to better resist issues caused by heat and other random events. It’s “simple” enough to drop a line and re-read it when reading inconsistent data (it’s more difficult when changes are present but nothing prevents from re-reading the faulty word if it was not modified). In case that’s what you’re facing, there could be a corner case of locally-modified data which has changed before re-reading it and in this case it’s guaranteed data corruption. Better ask HK about it. If your board is the only one with this problem they might prefer to exchange it. I do have the exact same CPU on another board (asrock) and am not facing any such issues. I’ve just installed rasdaemon which noticed nothing either.

          2. Some Intel chips go beyond parity checking and have ECC on L2. But I don’t know if this chip supports it, Intel doesn’t seem to document that openly.

            Found a command to check that: sudo dmidecode and look for “Error Correction Type”. On my i7-8650U it shows L1D is parity protected, L2 is one-bit ECC and L3 it multi-bit ECC.

          3. Yes I’m pretty sure that high-end chips do have ECC, but here we’re talking about an Atom basically. Well, marketingly speaking it’s a celeron 🙂

          4. willy, you’re correct for DRAM ECC, here we’re talking about the internal caches 😉

          5. Laurent, I was also speaking about internal caches 🙂 I’m sure I’ve seen it mentioned a few times in the past for Xeons and have some memories of seeing things like “L2 cache ECC” in some server BIOSes in the past. Note that it might have been quite old since my memory associates this to L2 not L3. Also IBM’s POWER8 and above definitely use ECC for L2 & L3.

          6. The link doesn’t work here, it displays “privatebin is a minimalist …”.

          7. Tried already, and again, but same result. It’s no big deal though, don’t worry 🙂

  3. Hi there,

    This is not normal. One error once in a couple months due to cosmic rays, that is normal.

    You are being warned because at that rate there is something wrong with the hardware or too much EMI from another device.

    1. > at that rate there is something wrong with the hardware

      So far only ECCed (corrected) single bit flips in L2 cache. Harms slightly performance and will only be a real issue once two bits flip at the same time. While I wouldn’t trust such a CPU that much for the average use case a Gemini Lake box is taken (media center, desktop) this shouldn’t matter that much.

      1. Physics is physics. You’re going to get bit flips in caches. Beta particles from C-14 decomposition. Beta particles from the Si itself, etc. Then there’s cosmic rays that you just can’t avoid as they’re everywhere. As transistors get smaller and smaller, Johnson–Nyquist noise becomes a bigger issue. The error may have even occured due to a transmission error on the chip–the right value may have been stored, but it got corrupted between the storage element and the ECC block.

        As long as the ECC is catching single bit errors at a relatively low rate, there’s noting to worry about. If it’s seeing double bit errors, then it’s time to be concerned.

        1. Sometimes there are 4 in the same second, this is far too much, something is busted in this machine, and its ability to recover from all the events you enumerated is affected by this existing one. The probability of two-bit *unrelated* errors remains very low. But if the hardware is defective, the probability of two-bit errors in the same cache line cannot be dismissed.

    2. My SSD makes some noise. Maybe that’s the source of the problem. I’ll remove it to check out what happens.

      1. That is usually coil whine and is not the issue.

        I’d try setting a lower maximum clock and see if the problem goes away. I’d want to check if power regulation/supply isn’t deficient.

        If the problem is the hardware, a torture test like Prime 95 will also make it much worse.

        1. I don’t use an x86 MB/memory combo without first doing at least 24 hours of memtest86 and then Prime95 on torture test. That helps validate memroy, processor, power delivery, etc.

          1. I used to run memtest86 for 24 hours, too. But then I got burned with broken RAM that memtest86 was not able to find no matter how long I tried. Didn’t know about Prime95 at that time but running sha1sum over all files in the filesystem repeatedly returned different hashes for the same files. And the problem went away by removing 2 of the 4 memory cards.

            As such, I think running memtest86 is just waste of time. Prime95 / mprime -t is much better option. However, you need to manually adjust the RAM usage to cover all chips. If you want to test your whole system, run max load *at the same time* (e.g. if you have high end GPU, run 3D benchmarks, run fio random read and writes to all storage devices).

        2. I’ve disconnected all SATA data and power cable, and the errors are still there. It seems even worse than the first time I ran sbc-bench.sh:

          Before I ran sbc-bench.sh I had 924 errors logged, and after:

          I’ll try to change the max frequency and see what happens.

          1. I’ve gone to the BIOS and changed two settings:

            ran sbc-bench again, and all MCE errors are gone:

          2. > I’ve gone to the BIOS and changed two settings

            And you lost around 1/3 of the CPU performance 🙂

            Smells a bit like unstable high DVFS OPP…

          3. Yeah, any of the modes be the cause of this unless they’re bugged. I agree with Kaiser it’s DVFS bug probably. Could also be something on the hardware design – like capacitors – which can’t handle higher current swings.

      2. BTW, I once had an issue with a motherboard BIOS where it would have correctable and uncorrectable L1/L2 errors when doing light loads.

        The board was initially K8 only but eventually supported Phenom 1 and 2.

        Somewhere something broke, as the motherboard induced errors if I enabled frequency scaling on the K8 but worked fine with a Phenom II – which used a different version of C&Q.

        K8 CPU worked fine if set at any fixed speed, just couldn’t change on demand. Prob was changing clocks too fast, without letting the VRM stabilize at the higher voltages first.

    1. Yes:

      I couldn’t see any L2 cache error correction yet since I booted my H2 around 30 hours ago.

      I think there should be no critical issue probably because any single bit error in the Cache memory was corrected automatically.

      1. We could have hoped better, like “given that yours shows problems we cannot reproduce it definitely indicates a hardware issue and a risk of accelerated aging, we’re going to replace it”. Their response is a bit disappointing.

        1. For full context, it’s a review sample, and I did not pay for it. Hardkernel just sent the kit to me free of charge.

          1. OK that’s understandable then. Still they’d better make some statements like “oh we know our early samples were not perfect” than let the doubt exist about their hardware.

          2. I agree with Willy – most manufacturers would want the board back for further testing. Doesn’t give much confidence that they will do proper Q/A.

  4. Thank you for reporting MCE problem on Gemini Lake. It helped me not to buy it. I already have MCE errors on Apollo Lake Asrock J3455-ITX and thought that problem will not show on Gemini Lake. Unfortunately it does not seam to be the case…

    avra@falcon:~$ uname -a
    Linux falcon 4.19.0-0.bpo.2-amd64 #1 SMP Debian 4.19.16-1~bpo9+1 (2019-02-07) x86_64 GNU/Linux
    avra@falcon:~$ dmesg | grep microcode
    dmesg: read kernel buffer failed: Operation not permitted
    avra@falcon:~$ sudo dmesg | grep microcode
    [sudo] password for avra:
    [ 0.961770] mce: [Hardware Error]: PROCESSOR 0:506c9 TIME 1564253320 SOCKET 0 APIC 0 microcode 1e
    [ 3.731218] microcode: sig=0x506c9, pf=0x1, revision=0x1e
    [ 3.731754] microcode: Microcode Update Driver: v2.2.
    avra@falcon:~$ sudo dmesg | grep mce
    [ 0.935565] mce: CPU supports 7 MCE banks
    [ 0.961475] mce: [Hardware Error]: Machine check events logged
    [ 0.961569] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: a600000000020408
    [ 0.961673] mce: [Hardware Error]: TSC 0 ADDR fef135c0
    [ 0.961770] mce: [Hardware Error]: PROCESSOR 0:506c9 TIME 1564253320 SOCKET 0 APIC 0 microcode 1e

    1. I think it’s a matter of luck. Most processors won’t have this issue, but some will have.
      Your problem looks a bit different. Does it cause you troubles or just output errors without actual user-facing issues?

  5. This is an error in the processor. It could be induced by bad power or inappropriate clock speed. Other than that, it is likely a busted processor chip.

    If you are just using the machine for benchmarking, who cares.
    Otherwise, I would discard the board.
    (I’m assuming that the processor is not socketted.)

    The right way of thinking about ECC is as a life preserver. Your boat is sinking, but you can float long enough to get to another boat. You don’t use it to try to keep the old boat floating.

  6. The “Drivers not loaded” message is a bit spurious. It just looks in /proc/modules for anything having edac in the name and assumes that is good, regardless of whether it is the right driver for your machine. If it doesn’t see anything, for example if the correct edac driver is built in to the kernel, then it reports that the driver isn’t loaded. So the message has nothing to do with MCE and may or may not be reporting the status of EDAC monitoring.

Leave a Reply

Your email address will not be published. Required fields are marked *

Khadas VIM4 SBC
Khadas VIM4 SBC