Linux Uptime Reset
dman: Anyone ever been on a Unix box and see uptime reset without a reboot?
SmooveB: dman: yes
David: dman: yes
David: dman: my personal favorate is the rollover or the negative rollover
SmooveB: dman: linux in particular
dman: LINUX, you say?
David: SmooveB: fancy that, that’s where I’ve seen it too
It seems that this issue is near and dear to those involved in “the uptime project” where, I guess, geeks sit around comparing uptime size. Which is kind of juvenile. However, this bug gave me a hearty laugh:
Steps to Reproduce:
1. Boot Linux system
2. Go away for 497 days
3. check uptime
That would be a bitch to QA!
For the pedantic, the cause is:
The Linux kernel (at least up to 2.4.2) has a flaw: It computes the result of the “uptime” based on the internal “jiffies” counter, which counts the time since boot, in units of 10 milliseconds. This is typecast as an “unsigned long” – on the Intel boxes, that’s an unsigned 32-bit number.
I have been researching to ensure that this won’t cause us other problems, since the monitoring system graphed a load spike on the system this morning, but I can’t correlate that with my own experience tailing the uptime. Anyway, from a good overview of jiffies and pre-emption timing:
The second issue is jiffie wraparound: the system uptime clock wraps around at about 497 days (32-bits holding 10 jiffies per second, 60 seconds in a minute, …). If we go to HZ=1000, the uptime clock wraps in 49.7 days. We can fix this by backporting the 64-bit jiffies we have in 2.5 to 2.4.
So, are there other foul things that can happen to us after jiffie wrap-around? (We don’t want to reboot an important box that’s been running for more than a year if we can avoid it.) There’s a discussion on the need for 64-bit jiffie and whether it might just be better to account for jiffie wrap-around in code. Denis Vlasenko suggested in 2003 that the kernel trigger a jiffie wrap five minutes after boot, to expose bugs in driver code:
“There were reports of machines hanging on jiffy wrap. This is typically a result of incorrect jiffy use in some driver.” [Denis] suggested wrapping the jiffy counter within the first five minutes of uptime. Driver writers would be bitten by their bugs, and fix them right away.
Fortunately, I’m not finding many jiffie wrap-around bugs in my Google searches. But then, who knows? This is where disaster recovery and best practices come in to account for the occasional fallibility of the Unix OS, and that’s how guys like me get paid.