October 17, 2009

mcelog - monitor hardware issues

It is often the case to receive a call in the middle of the night or walk into the office and find out a server which failed due to hardware problems.

What is mcelog?

From the project site description

mcelog decodes machine check events (hardware errors) on x86-64 machines running a 64-bit Linux kernel. It should be run regularly as a cron job on any x86-64 Linux system (if it is not in the default packages on your x86-64 distribution, please complain to your distributor). It can also decode machine check panic messages from console logs.

Now before we go on, it is important to understand what MCE is

MCE stands for Machine Check Exception, which is nothing but an AMD/Intel 64bit feature that allows to detect unrecoverable hardware problem such as “Communication errors between the motherboard and the CPU, CPU cache errors, Memory ECC erros etc..”

A common MCE log error would look like this

CPU 0: Machine Check Exception: 0000000400000000<0> fault: 0000 CPU: 0 EIP: 0010:[mcheck_fault+225336] EFLAGS: 00010246 eax: 00000115 ebx: 72000000 ecx: 00000405 edx: 72000000 esi: 00000004 edi: 00000003 ebp: 00000115 esp: c3187f94 ds: 0018 es: 0018 ss: 0018

A program like syslogd will write the message to the console or to the kernel log; if the machine crashes, then only to the console.

mcelog will therefore “decode” those machine event errors, which are saved in the special kernel buffer /dev/mcelog.

Work with mcelog

mcelog should be run as a cron

/usr/sbin/mcelog –generic –ignorenodev –filter >> /var/log/mcelog

Make sure to check the man page of mcelog for all the options.

I would recommend setting up a script to email you in case of alerts or even why not “pipe your mcelog through a socket

That’s it. Hopefully from now, you can catch system, hardware errors before a kernel panic :-)