Latest Tweets

Pattern-finding in the maelstrom: log-analysis.

Odd behaviour

A node running Suse Linux was acting oddly, indeed. According to the last command, there were different reboots at a very weird times, like midnight or so. Not only that: we found  a hypothetical console login from the superuser past midnight! Below, what we got running the last command:

reboot   system boot  2.6.21.1-default Fri Jan 21 00:25         (11+12:35)
root     tty1                          Fri Jan 21 00:22 – 00:25  (00:02)
reboot   system boot  2.6.21.1-default Fri Jan 21 00:21          (00:04)
reboot   system boot  2.6.21.1-default Thu Jan 20 23:33          (00:00)
reboot   system boot  2.6.21.1-default Thu Jan 20 23:29          (00:04)
reboot   system boot  2.6.21.1-default Thu Jan 20 23:26          (00:01)
reboot   system boot  2.6.21.1-default Thu Jan 20 23:23          (00:01)
reboot   system boot  2.6.21.1-default Thu Jan 20 23:20          (00:03)
reboot   system boot  2.6.21.1-default Thu Jan 20 23:17          (00:07)

To put it plain, it looked like we had been hacked. What aroused my curiosity this time was, above all, the idea of having a real user rebooting the computing node in the flesh, according to last output, at a such odd time of day. Indeed, what kind of hacker could be interested in acting like that? It seemed completely foolish. So, I decided to analyse the logs, as usual, in order to determine the real issue. Had we been hacked, after all? Was it because of an unknown BUG? Was it related to some sort of a CPU-peek process anomaly crashing the entire system down? Was it a CPU-temperature issue?

Coming to a hypothesis.

The computing node was powered off for almost a whole month due to motherboard and processor problems. Thus, I replaced them and put the node to work for two days. It seemed it was working so far so good, it had not been for this odd rebooting behaviour. So, the node was working fine, even when we detected these reboots. There was no CPU temperature issue involved, as suggested, and the system was apparently stable. The reboots occurring, according to last output, every three minutes or so, showed me some sort of a pattern. It is true that just before replacing the processor unit, we got a CPU-temperature error. It was correctly reported through  syslog, as shown next:

Jan 20 23:33:22 node108 kernel: CPU1: Temperature above threshold, cpu clock throttled (total events = 1)

But I did replace it, so the theory of having odd reboots as a consequence of excessive CPU temperature was totally out of the question. Then, the six consecutive reboots had nothing to do with the processor behaviour. At the same time, the superuser login on /dev/tty1 – that is, the actual console -, was difficult to explain. Was it some user login in as a root in order to reboot the computing node direclty? Why? It is perfecly feasible to reboot a computing node, any computing node as a matter of fact, remotely. I knew I logged myself in during my computing node tests right after replacing the processor and I did reboot the node from the console after the exact same time reported by the last command.

Thus, I tried to find a valid explanation for the six reboots in a row. When I replaced the motherboard, I used the same processor at first. Then, I got that awful temperature above threshold message, as shown earlier. I remembered the time spent for the system to boot completely and offering me a tty login screen was about three minutes. Then, I detected the cpu temperature messages and rebooted the computing node.

So, I could explain pretty well what those reboots were about, even that superuser login in the console and posterior reboot was about. It seemed there was only one hypothesis to be produced.

Hypothesis: the system’s clock is set badly

In order to make this hypothesis true, I had to prove two different things. Firstly, there had to be messages in the log files reporting date and time errors. So, I looked for them and I found them:

34 ntp:20 Jan 20:28:17 ntpd[3319]: synchronized to 10.0.1.1, stratum 3
35 ntp:20 Jan 20:54:01 ntpd[3319]: time reset -0.210387 s

Secondly, if that was true, the system log files had to be written chaotically, that is, in a no ordered manner. For instance, annotating future entries in previous file row positions. Let t be a reasonable date-time entry in any system log file, that meant we could find another t’, t”’, t”’ … date time annotations log-entries proving that t>>t’>>t”>>t”’>>… and being placed all along any system log file this way:

t , t”, t, t”’, t, t”, t”’, t …,

To sum up, the log file had to be erroneously annotated.

To prove that, I read the /var/log/messages file lineally using vim:

Jan 21 00:26:12 node108 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Jan 21 00:26:12 node108 kernel: input: Power Button (FF) as /class/input/input3
Jan 21 00:26:12 node108 kernel: ACPI: Power Button (FF) [PWRF]
Jan 21 00:26:12 node108 kernel: input: Power Button (CM) as /class/input/input4
Jan 21 00:26:12 node108 kernel: ACPI: Power Button (CM) [PWRB]
Jan 20 16:27:07 node108 syslog-ng[2426]: Changing permissions on special file /dev/xconsole
Jan 20 16:27:07 node108 syslog-ng[2426]: Changing permissions on special file /dev/tty10
Jan 21 00:26:12 node108 kernel: speedstep-centrino with X86_SPEEDSTEP_CENTRINO_ACPI config is deprecated.
Jan 21 00:26:12 node108 kernel:  Use X86_ACPI_CPUFREQ (acpi-cpufreq) instead.
(…)

I was right. The log entries were clearly in a non ordered manner, and the date and time for those ones showing us reboots and a root login using the actual computing node console were, in fact, completely normal.

Conclusions

Understanding log files and how to analyse and process them is completely essential to face weird problems which, at first sight, seem really make-believe. I guess there are out there not many Systems Managers doing so, which is, I have to say, a real shame. To all of those not used to analysing logs, I encourage them to start doing that right away.