Categories

ID #1061

What is an MCE failure? Why is it running alongside the disk burn test?

What is an MCE failure?

 

MCE stands for Machine Check Exception, and should not be ignored.

 

If you see the kernel reporting these, then it is highly likely that the hardware it is running on is not functioning properly and that the vendor needs to fix something.

 

Most commonly, you uncover these during Bright's burn (stress test) of the cluster.

 

Why is it running alongside the disk burn test?

Quite often the problem is memory-related. The mce_check burn test constantly monitors the kernel for MCE reports, which is why it runs in parallel to the disk burn test, as well as in almost all other tests. In some cases, stressing the disks will also trigger an MCE error. The exact MCE errors are logged to a file in the node's burn spool.

 

Have a look in the appendix on "Burning Nodes" for more on doing burns in general.

Tags: -

Related entries:

You cannot comment on this entry