Our current server, a dual Xeon with a SCSI RAID10 array of four 15.000-rpm SCSI disks, has been up for almost a year. Unfortunately we started to see random lockups a few weeks ago, killing our uptime as we had to reboot to keep it running. You might even have noticed that we were offline for a few minutes here and there. Obviously we examined the logs only to find that the problem couldn’t be easily diagnosed, it could’ve been software, but a hardware defect was just as likely. I drove up to the ISP last Friday to examine the server from up close and couldn’t really find what was wrong with it until I powered it off and back on again. Suddenly none of the four SCSI disks nor the Adaptec SCSI RAID controller were showing up during boot up.
Fig 1. Our web server, featuring dual Intel Xeons, 2GB of memory and a RAID10 array.
At these moments you just look at the screen and think "they were working just a minute ago, gimme a break here", but as always Murphy’s Law is unforgiving and neither the controller nor the disks would budge. As the server was built from new parts and painstakingly put together ensuring no part was placed or handled without an anti-static wristband and had been running for just over a year we were probably looking at a manufacturer defect. Getting a replacement for the SCSI controller wasn’t going to be easy though as we don’t keep spares, nor would finding a new one be an easy task on a Friday. So what do you do? You try to make it work again. I took it out of the slot, cleaned the PCI connector with a cotton cloth and some alcohol, and did the same to the PCI riser card it uses to connect to the motherboard.
After putting it all back together again I powered the server up, thumbs crossed and thinking "c’mon baby, you can do it" and fortunately saw the SCSI controller boot its kernel and find the disks and proceed to boot the Linux OS. Naturally you need a backup plan in case something like this happens and we could have switched to another server on another location, but that would mean many hours of downtime while DNS propagated around the world. So to prepare for worse we’ve now built a backup server and will be hooking that up to a fail-over network switch so we can switch between the two servers from remote without the need to change the DNS. We’ll have more details on the construction of that new server and the switching between them in an upcoming article which will be posted after our Computex coverage.