Processors
Intel® Processors, Tools, and Utilities
14539 Discussions

Help with MCE Error Analysis

idata
Employee
1,765 Views

Our server exhibits several MCE errors which I cannot analyse further because

I cannot find the meaning of the error codes. After further research it was suggested

to ask the chip vendor for help. I would be very thankful for any ideas, opinions or

clarifications any of you could give me.

The board supports Linux OS. The system is:

 

Proxmox Version: 3.0-23/957f0862

Kernel: 2.6.32-20-pve

The following logs are attached:

kern.log.zip - The zip compressed kernel log from 2 days, showing very early after reboot MCE errors

mcelog.zip - The zip compressed log from the mce error logging tool "mcelog"

lspci - The configuration of the system as shown by lspci -vv

lscpu gives the following information:

Architecture: x86_64CPU op-mode(s): 32-bit, 64-bitByte Order: Little EndianCPU(s): 24

On-line CPU(s) list: 0-23

Thread(s) per core:2Core(s) per socket:6Socket(s): 2NUMA node(s): 2Vendor ID: GenuineIntelCPU family: 6Model: 45Stepping: 7CPU MHz: 2194.845BogoMIPS: 4388.82Virtualization: VT-xL1d cache: 32KL1i cache: 32KL2 cache: 256KL3 cache: 15360KNUMA node0 CPU(s): 0-5,12-17NUMA node1 CPU(s): 6-11,18-23

Thank you very much

0 Kudos
6 Replies
Silvia_L_Intel1
Employee
825 Views

Hello SimonBusinessEngineering, please let me look into this in order to help you.

In the mean time, would you please provide me the model number of the server board you are using and the processor?

 

0 Kudos
idata
Employee
824 Views

Hello Sylvia,

thank you for your time. The Model Number of the server is S2400SC and the processor is Intel Xeon E5-2430 2,2GHz 7,2GT 15MB 6C.

Unfortunately I cannot find the sSpec of the processor because the distributor did not attach a sticker anywhere.

 

0 Kudos
Silvia_L_Intel1
Employee
825 Views

SimonBusinessEngineering, MCE errors almost always indicate a failing memory DIMM, a DIMM socket or it could be a defective CPU.

The errors you are getting appear to be related to memory channel 3.

 

Would you please let me know if this is a new build? If not, was the system running well and then started to fail?

I would suggest you removing as much memory DIMM's as possible to determine if the problem is DIMM related.

 

If you find out that there is a problem with a memory DIMM, I would recommend you to contact our Warranty department so they can help you replacing the server board.

http://www.intel.com/p/en_US/support/contactsupport http://www.intel.com/p/en_US/support/contactsupport

0 Kudos
idata
Employee
824 Views

Thank you for your information, I will try to find out if it is a DIMM related problem.

To answer your question, the server had an uptime of almost a year and I was very satisfied with it until, without any change in usage, it started to fail. Reproducing the problem with less memory will take the rest of the week I guess, I write as soon as I have more information or was able to fix or isolate the problem. Thank you again for your time.

0 Kudos
Silvia_L_Intel1
Employee
825 Views

Any time SimonBusinessEngineering!

0 Kudos
idata
Employee
825 Views

After checking every DIMM, I found one with which the server would not even start. Thank you very much we are going to contact the vendor now for a replacement.

0 Kudos
Reply