February 10th, 2004


Oh fsck!

More days like this, and I'll have a heart attack before I turn 30. To begin with, it looks like one of the multi-processor remote cluster machines had suffered a kernel panic (the error message read, and I quote: "<0> Kernel panic: aiee, killing interrupt handler!" Glad that the people writing this stuff at least had a sense of humor) and died sometime last night. OK, I reboot it, run fsck (File System ChecK utility. Does pretty much the same thing as chkdsk in Windows), fix a few bad inodes, and get it back up and running.

Then, about an hour later, just as I finish patting myself on the back, about three different people converge on my office saying that their email/printing/local web browsing suddenly stopped working. With a sinking feeling I try pinging Condor (the central hub Linux server that the rest of our systems depend on), and confirm that it's down. Run to the server room, lug the only keyboard/monitor set in the room from the recently resurrected cluster machine to the recently deceased Condor, and get confronted with a screenful of Linux error messages which roughly translate to "it's worse than that, I'm dead, Jim". Run to our business manager's office, get my predecessor's phone number, call him and get confirmation that it is in fact OK to reboot the hub by itself and that there aren't any steps that have to be taken upon successful rebooting. Run back to the server room, reboot the system, cross every imaginable extremity, and watch with horror as the main SCSI hard drive is proclaimed to be too error-filled to mount. Run back to my office, grab my Unix book, run back to the server room, reboot the computer and start it in the Linux equivalent of Safe Mode. Then I sped the next 30 minutes running fsck on the main hard drive while reading the chapter on recovering filesystems. Luckily, due to the drive being already unmounted, I did not cause any massive damage to the system, and in fact managed to recover most of the drive, dozens of filesystem errors notwithstanding. Then another reboot, and praise be to the gods, it boots all the way to login! Run back to my office, ssh as root into condor, give permission for non-root users to log in, run through the corridors screaming "Condor's back up!" to wild cheers, and then crawl back to my office and exhale.

On one hand, bloody hell that was close. On the other hand, I managed to recover from a major emergency without screwing anything up (I hope) and have managed to maintain an appearance of competency. ^^ Now to make sure nothing important got fscked up.
  • Current Mood
    productive productive