The last Friday in July (today!) is the 16th annual system administrator appreciation day, an obscure celebration started in 2000 (by a system administrator, of course) as a response to an H-P ad showing users expressing gratitude to their sysadmin for installing the advertiser’s latest printer. To my knowledge, none of us have ever gotten flowers or even donuts on “our” day, but it does remind us in the profession that our job is to keep the users happy, mostly by keeping the machines happy, but also by attending to their needs in a prompt and professional manner.
I was reminded of the event not only by notices in the discussion forums and IT email lists, but by the fact that today, the replacement memory module for our network server came, and I installed it. A simple procedure, but one that takes a fair portion of the sysadmin’s bag of tricks and tools to accomplish. Bigger shops might have a service contract with the hardware vendor, but in many cases, the sysadmin is also the hardware mechanic.
For a few months, the server, a Dell T110, has been crashing every few weeks, fortunately not while we were on our two-month grand tour, but of concern, naturally. especially because it is a virtual machine host, and often has a half-dozen virtual machines running on it, which means, when the server goes down, half of our network goes with it. Virtualization is a great way to run different versions or distributions of operating systems when developing and testing software, so not too many have production roles in the network, but it is still an inconvenience to have to restart all of them in event of a crash.
A red light appeared on the front panel of the server, indicating an internal hardware condition, so it was time to check it out. First, hardware designed for use as servers (the T110 is aimed at small offices like mine) is a lot more robust than the average tower workstation you might have on your desk. Note above the heavy-duty CPU heat sink (air baffles have been removed for access to the memory modules–the four horizontal strips above the CPU fins). In addition, big computers have little computers inside that keep track of the status of the various components, like memory, CPU, fans, and disk drives, and turn on the light on the panel that indicates the machine needs service. Server memory has error-correction circuitry, as do most server-quality disk arrays, but this is limited to one error–the next one will bring the system down.
System administrators depend on these self-correcting circuits and error indications to schedule orderly shutdowns for maintenance, so that the machine doesn’t crash in the middle of the workday. For most offices, this means late evening or weekend work. For 24-hour operations, like web sites, it means shifting the load to one or more redundant systems while the ailing one is repaired, so no data is lost. Companies like Dell supply monitoring software to notify sysadmins of impending problems, which is vital to operations where there is a room full of noisy servers and the admins are in a nice quiet office in the back room. In our case, with just one server, we don’t use the monitoring software regularly, but it is useful for telling us which component the red light is for; then we can look up the location in the service manual and order the right part, and hope the system doesn’t crash before it arrives.
Normally, businesses that are thriving and need to keep competitive in the market replace their machines at least every three years. Others, like ours, that operate on a shoestring and buy whatever resources we need for a project when we need it, tend to run machines five years or more, sometimes until repair parts are no longer available: since we run Linux, we have machines eight years or older that are still useful for running some network services.
Our server is almost five years old, so when I order replacement parts, they don’t always look like the ones we took out, or have the same specifications. For that reason, I usually take replacement as an opportunity to upgrade, replacing all of a group of components with the a new set, which I did when a disk drive failed a couple years ago. However, this time, since I’m semi-retired and don’t have a steady cash flow, I only ordered one memory module, to replace the failing one. Memory comes in pairs, so having slightly different configurations in a pair causes the machine to complain on startup, but it still runs. The “upgrade” alternative would have replaced all four modules, or at least the two paired ones, with the larger size, at a cost of $150 to $300 instead of just replacing a $50 module and putting up with having to manually restart the machine on reboot.
So, the sysadmin not only needs to keep the machines running, but running within budget, and making sure the operating systems and hardware capabiities can support the software users need to do their jobs. If he or she is doing their job right, there won’t be any red lights in the server room, and the sysadmin will look like they aren’t doing anything…