Processors
Intel® Processors, Tools, and Utilities
14395 Discussions

Diagnosing source of mcelog errors under Linux for Pentium D 950

idata
Employee
1,819 Views

We had an event at the start of this month in which the office environmental temperature went out of spec (normally under 77 degrees, but for a few days did spike into at least the mid 80's). Unfortunately, after that was addressed we begain to see mcelog errors on a linux workstation...and the errors have become ever an ever more frequent.

We suspect processor damage -- the 950 generally already is quite sensitive to temp as it is running at 3.4Ghz and is near an older high end nvidia graphics card, but we would like to ensure this is the case before buying a replacement and installing rather than swappingout memory or motherboard.

Note that to test we tried to change the processor frequency but the Pentium D doesn't support this, so we had to mess with T states. Interesting with a T state of 4, which ensures that only one core has load at any one time, the mcelog errors go away and the system seems completely stable althrough very very slow. I would assume this reinforces the assumption that it is a processor related issue and not motherboard/ram.

mcelog errors generally following the following pattern:

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCA: Instruction CACHE Level-3 Instruction-Fetch Error

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com STATUS 9000000000000153 MCGSTATUS 0

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCGCAP 180204 APICID 0 SOCKETID 0

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com CPUID Vendor Intel Family 15 Model 6

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com HARDWARE ERROR. This is *NOT* a software problem!

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com Please contact your hardware vendor

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCE 30

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com CPU 0 BANK 0

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MISC 140002d0002a0 ADDR 1b83041c0

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com TIME 1329639003 Sun Feb 19 00:10:03 2012

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCG status:

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCi status:

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com Error overflow

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCi_MISC register valid

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCi_ADDR register valid

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCA: Generic CACHE Level-1 Snoop Error

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com Corrected events: 255

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MISC format 0 value 140002d0002a0

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com STATUS cc0000ff20040189 MCGSTATUS 0

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCGCAP 180204 APICID 0 SOCKETID 0

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com CPUID Vendor Intel Family 15 Model 6

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com HARDWARE ERROR. This is *NOT* a software problem!

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com Please contact your hardware vendor

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCE 31

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com CPU 0 BANK 1

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com TIME 1329639003 Sun Feb 19 00:10:03 2012

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCG status:

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCi status:

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCA: Data CACHE Level-1 Data-Read Error

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com Corrected events: 200

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com STATUS 800008c800000135 MCGSTATUS 0

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com MCGCAP 180204 APICID 0 SOCKETID 0

 

Feb 19 00:10:03 hyperion.professionalsysadmin.com CPUID Vendor Intel Family 15 Model 6

Note that these all seem to be cache errors....I'm assuming that these must be shared somehow between the cores and the damage might be interfering with their ability to lock changes/etc....

Can we be confident that this is a processor damage issue?

I've checked the computer interior and cleaned it out and verified the system fan is running properly and that general temps are correct along with voltages:

when acpi t state is 4:

processor temp is ~48-50 degrees c

motherboard temp seems stable at ~35 degrees c

video card ambient temp is 39 degrees c

video card core temp is about 51 degrees c

Initial error frequency was just a few times/day but was roughly every 15 minutes yesterday until the t state was switched to 4.

0 Kudos
1 Reply
Adolfo_S_Intel2
Employee
539 Views

The only way to make sure that the processor is the defective component is by testing the processor on a 2nd motherboard to see if it causes the same behavior, or by testing another processor on your system.

0 Kudos
Reply