Home > Linux, RedHat > Linux reaction on AMD Phenom x4 overheating

Linux reaction on AMD Phenom x4 overheating

I’ll beat you with your spinal cord
Split your skull in two
I’ll feast on your intestines
There’s nothing I can’t do

(Iced Earth – Violator)

Next day my ssh console greeted me with some nice and unexpected output:

Message from syslogd@mon at Mar  7 20:51:28 ...
 kernel:[Hardware Error]: MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 
0x9c5c40e1011c011b

Message from syslogd@mon at Mar  7 20:51:28 ...
 kernel:[Hardware Error]: Northbridge Error
 (node 0, core 3): L3 ECC data cache error.

Message from syslogd@mon at Mar  7 20:51:28 ...
 kernel:[Hardware Error]: cache level: L3/GEN, tx:
 GEN, mem-tx: RD

L3 is a CPU cache, so I guess my CPU went berzerk, or at least one of it’s cores. It’s a 3GHz AMD Phenom(tm) II X4 945 processor, with 4 cores. I heard stories about 3-core CPU’s beign 4 core with 1 locked core, because it showed some instability during manufacturing process. So I thought maybe this CPU was on the borderline of beign marked as 3-core? But at my surprise, new error spawned, but this time saying “node0, core0”.

This is how the error looks in /var/log/messages:

Mar  7 06:51:28 node kernel: [Hardware Error]:
 MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc5c40e0011c017b
Mar  7 06:51:28 node kernel: [Hardware Error]:
 Northbridge Error (node 0): L3 ECC data cache error.
Mar  7 06:51:28 node kernel: [Hardware Error]:
 cache level: L3/GEN, tx: GEN, mem-tx: EV
Mar  7 06:51:28 node kernel: [Hardware Error]:
 Machine check events logged

Friend of mine had idea – maybe it’s everything ok with hardware – but it’s overheating?! Well, nice idea, lets check:

fan1:       3139 RPM  (min =    0 RPM)
fan2:          0 RPM  (min =    0 RPM)
fan3:          0 RPM  (min =    0 RPM)
fan5:          0 RPM  (min =    0 RPM)
temp1:       +32.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
temp2:       +77.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermal diode
temp3:       +68.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
cpu0_vid:   +1.050 V

So off we went to clean the dust. After vacuuming, temperatures dropped sharply as did the RPM’s of the CPU fan:

fan1:       2410 RPM  (min =    0 RPM)
fan2:          0 RPM  (min =    0 RPM)
fan3:          0 RPM  (min =    0 RPM)
fan5:          0 RPM  (min =    0 RPM)
temp1:       +30.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
temp2:       +46.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermal diode
temp3:       +43.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
cpu0_vid:   +1.050 V

I guess that’s it, 4 days and no more errors from CPU…

Advertisements
  1. October 17, 2014 at 5:02 pm

    You have no idea how much this helped. It’s unbelievable how temp-sensitive these things are. 77°C does not seem extreme: my servers have an alarm threshold at 97°C.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: