dannyman.toldme.com


Linux, Technical

Linux Uptime Reset

dman: Anyone ever been on a Unix box and see uptime reset without a reboot?
SmooveB: dman: yes
David: dman: yes
***dman smiles
David: dman: my personal favorate is the rollover or the negative rollover
SmooveB: dman: linux in particular
dman: LINUX, you say?
David: SmooveB: fancy that, that’s where I’ve seen it too

It seems that this issue is near and dear to those involved in “the uptime project” where, I guess, geeks sit around comparing uptime size. Which is kind of juvenile. However, this bug gave me a hearty laugh:

How reproducible:
ALWAYS

Steps to Reproduce:
1. Boot Linux system
2. Go away for 497 days
3. check uptime

That would be a bitch to QA!

For the pedantic, the cause is:

The Linux kernel (at least up to 2.4.2) has a flaw: It computes the result of the “uptime” based on the internal “jiffies” counter, which counts the time since boot, in units of 10 milliseconds. This is typecast as an “unsigned long” – on the Intel boxes, that’s an unsigned 32-bit number.

I have been researching to ensure that this won’t cause us other problems, since the monitoring system graphed a load spike on the system this morning, but I can’t correlate that with my own experience tailing the uptime. Anyway, from a good overview of jiffies and pre-emption timing:

The second issue is jiffie wraparound: the system uptime clock wraps around at about 497 days (32-bits holding 10 jiffies per second, 60 seconds in a minute, …). If we go to HZ=1000, the uptime clock wraps in 49.7 days. We can fix this by backporting the 64-bit jiffies we have in 2.5 to 2.4.

So, are there other foul things that can happen to us after jiffie wrap-around? (We don’t want to reboot an important box that’s been running for more than a year if we can avoid it.) There’s a discussion on the need for 64-bit jiffie and whether it might just be better to account for jiffie wrap-around in code. Denis Vlasenko suggested in 2003 that the kernel trigger a jiffie wrap five minutes after boot, to expose bugs in driver code:

“There were reports of machines hanging on jiffy wrap. This is typically a result of incorrect jiffy use in some driver.” [Denis] suggested wrapping the jiffy counter within the first five minutes of uptime. Driver writers would be bitten by their bugs, and fix them right away.

Fortunately, I’m not finding many jiffie wrap-around bugs in my Google searches. But then, who knows? This is where disaster recovery and best practices come in to account for the occasional fallibility of the Unix OS, and that’s how guys like me get paid.

Read More

Next:
Previous:
Categories: Linux, Technical
Possibly-Related Posts

Responses

November 20th, 2007

Dawid van der Merwe

Hi Danny,

I have a similar issue to what you have described in “Linux Uptime Reset” (http://dannyman.toldme.com/2006/08/16/linux-uptime-reset/).

My problem however is that I use snmp to plot graphs for Uptime and those graphs are not correlating to the actual uptime when checked on the physical host. I was wondering if it is possible that the snmp could be resetting every week and what would be needed to identify it as such or to be able to work around it?

Thanks,
Dawid vd Merwe

April 15th, 2008

uptime wraps again « Don’t fear the penguin

[…] ένα παλιό σύστημα εμφανίστηκε πάλι το φαινόμενο ‘uptime-reset-without-reboot‘ (μου συμβαίνει κάθε 500 […]

Comment

Leave a comment . . .

Tiny Print:

  1. For private messages, e-mail me: dannyman@toldme.com.
  2. You must provide an e-mail address.
  3. You can use a bogus e-mail address, but I like to know who you are.
  4. I will not spam you. I will not publish or share your e-mail address.
  5. First-time commenters will be held for review.
  6. You can use these HTML tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>