Sleep deprived, that how I’d describe my current condition best, but with a feeling of deep satisfaction and relief. Add to that my renewed faith in Murphy’s Law and I think you get the picture that I had a rough night but all’s well that ends well. I mean, we’ve had little mishaps in the labs before but I wasn’t really prepared for the obstacle we came across yesterday. As you may have noticed our new webserver has had a few hiccups over the past month or so. We’ve been up and running without a single problem since we brought it online
in November of last year and only since last month we started to get NMI errors (reason 21) that would eventually result in a server crash. We obviously went over the error logs and tried to determine what the cause could be. We diagnosed it as either an ECC memory error or some NIC glitch, none of which can be easily fixed from remote, as a bad DIMM simply needs to be replaced and the NIC is integrated on the motherboard. Unfortunately you can’t simply take a live webserver offline and disassemble it fully without some downtime, so yesterday I decided to drive up to the datacenter
to see whether some sort of odd hardware failure was causing these crashes.
Fig 1. The dead webserver with dual Tualatins at 1.26GHz and four WDC1200JBs in RAID10.
Having arrived there I found the webserver non-responsive and I couldn’t get in from either the console or through ssh. I was basically staring at the login screen that wasn’t registering any of my keystrokes. I then powered the webserver down by simply flipping the powerswitch and with that took the last bit of life left out of it, as I never managed to power it back on again. I then decided to pack it up and take it with me for further analysis and hopefully a quick revival so I could put it back in the rack and get Hardware Analysis back online again. But I quickly became convinced that Murphy’s Law was at work here and the powers that be had decided otherwise for our trustworthy webserver. I swapped out the memory, the processors and tried everything possible to get it to post, but all it did was turn on the NIC lights and spin it’s fans at me and that was it.
So here I was, looking at a dead webserver containing all of our website data on it’s IDE RAID10 array. We have backups of course, but the last-minute user comments and other content would not be included in that backup, so worst case, we’d be setting things back by at least a week. I wanted to try and see if I could salvage at least part of the content and if possible the entire RAID array so we could be off to a quick start once I copied the data over to the new webserver we’ve been working on already. That part of the process actually went over smoothly, smoother than I expected even, as after installing RH9 on a new machine and connecting the disks to a similar IDE RAID controller and mounting the array, we could simply restore the array and copy all the important data over to safe location with little effort.
Fig 2. Recovering the RAID10 array, this went over smoothly with very little effort.
Then we proceeded with setting up the new server with RH9 and configuring it similar to the old server. Me and Vitaliy worked for most of the night restoring all of Hardware Analysis from the RAID array as well as installing and optimizing the new server at the same time. The result is obvious, as otherwise you wouldn’t be reading this page or any other page on Hardware Analysis. Most of today was spent bug fixing and making sure everything was up and running and configured properly. After we were confident that everything was alright I turned my attention towards finding a cause for the server crash. It quickly became apparent that the powersupply had been acting up and had gradually deteriorated into one last power surge that wiped out the motherboard, memory, processors and all else, but for the RAID array and the CDROM.
Fig 3. The new webserver, featuring dual 2.8GHz Xeon processors and four Seagate Barracudas in RAID10.
Needless to say I’ll be calling the manufacturer this week and I’ll see whether he’s willing to pay for any of the damages as both the case and the powersupply are still under warranty and this is something you don’t expect from a reputable manufacturer. What we’ve learned from this is simply that picking your parts carefully is important, and having recent backups is essential. If we had not used a RAID10 array but a single disk, chances are much larger we’d have lost most, if not all, of the content and would have needed to use the backup instead. In this case we had a RAID10 array of four drives, comprising of two sets of disks with the same data, which could be extracted and restored to a fully working site to the very moment the system stopped working.