mcelog – monitor hardware issues
It is often the case to receive a call in the middle of the night or walk into the office and find out a server which failed due to hardware problems.
What is mcelog?
From the project site description
mcelog decodes machine check events (hardware errors) on x86-64 machines running a 64-bit Linux kernel. It should be run regularly as a cron job on any x86-64 Linux system (if it is not in the default packages on your x86-64 distribution, please complain to your distributor). It can also decode machine check panic messages from console logs.
Now before we go on, it is important to understand what MCE is
MCE stands for Machine Check Exception, which is nothing but an AMD/Intel 64bit feature that allows to detect unrecoverable hardware problem such as “Communication errors between the motherboard and the CPU, CPU cache errors, Memory ECC erros etc..”
A common MCE log error would look like this
CPU 0: Machine Check Exception: 0000000400000000<0>
fault: 0000
CPU: 0
EIP: 0010:[mcheck_fault+225/336]
EFLAGS: 00010246
eax: 00000115 ebx: 72000000 ecx: 00000405 edx: 72000000
esi: 00000004 edi: 00000003 ebp: 00000115 esp: c3187f94
ds: 0018 es: 0018 ss: 0018
A program like syslogd will write the message to the console or to the kernel log; if the machine crashes, then only to the console.
mcelog will therefore “decode” those machine event errors, which are saved in the special kernel buffer /dev/mcelog.
Work with mcelog
mcelog should be run as a cron
/usr/sbin/mcelog –generic –ignorenodev –filter >> /var/log/mcelog
Make sure to check the man page of mcelog for all the options.
I would recommend setting up a script to email you in case of alerts or even why not “pipe your mcelog through a socket”
That’s it. Hopefully from now, you can catch system, hardware errors before a kernel panic
Cheers,