Good, fast, or cheap: pick any two. A few years ago, I decided to build a webcam, rather than buy one, which were about $100, plus whatever monthly service charge for hosting the link on the cloud. I’m not sure I beat the cost, quality, or speed, but it’s kept me actively managing the system. Instead of a plug-n-play wifi-enabled little module, I have a rats-nest of wires, USB hubs, USB external disk drives, Raspberry Pi with external camera on a ribbon cable, and, now, extension cords and 50-foot CAT5e cable. About once a year, I wear out the flash drive that the system runs from, so there is some on-going cost. Plus, much coding in Python and Bash, a distributed network system to process the video, cron jobs, an API key and code to get weather information and sunrise/sunset times to turn the camera on and off.
Meanwhile, the landscaping has grown up around the office window, so the camera sees mostly flowers and bees (left view). So, I moved it to the office closet, which was not so simple. 1) Being “cheap and fast,” the software wasn’t very “good,” so I had to modify the Python code to provide a way to restart the system during the day without losing all the footage: the system keeps a week’s worth of data, and erases last week’s when starting a new day. This also entailed generating images with a timestamp, rather than a simple index, as the camera software libraries start indexing at 1 each instance.
OK, that’s done, and the system retested, bugs fixed, etc., which ended up losing most of a couple day’s surveillance: “cheap” means not having a second system for development and test, and “fast” means not doing a proper code review before testing, which leaves “good” out of the equation.
Of course, nothing ever goes smoothly: after moving the computer/camera, the USB hub and disks into the closet, we weren’t getting communication with the processor. So, drag everything out next to the desk so we could hook up a console (keyboard, mouse, and monitor) to the computer, retest with the original ethernet cable, then with the long one. Everything worked, inexplicably, since nothing really changed except having the console hooked up. Unhooked the console, and moved everything, still running, back into the closet, then adjust the camera view, and we’re done–except for resetting the key agent so the computer could talk to the video processing computer.
Our program takes a photo every 10 seconds, updated to the web server, then assembles a timelapse video once an hour, showing one hour in 30 seconds. After letting the revised program run for a couple hours, we checked the logs and directories: still showing last week’s video. Aha. The video compositor program needs a numerical sequence for the images in order to assemble a video: the timestamp doesn’t meet specification. So, back to the drawing board, rewrite the Bash script on the video processing computer to renumber the files in a format the video assembly utility understands. Success at last. The system is now fully functional, but made a bit more complex by the simply addition of a restart ability.
So, not fast, not good, and not cheap, when you consider the effort put into a custom, one-of-a-kind system. But, it keeps me in practice coding and designing. And, because it runs on Linux, I can keep the security patches current: many purchased plug-and-play “appliances” have their code burned in at time of manufacture, and may be designed around already obsolete and buggy software. My little system has undergone several major upgrades of the Debian Linux distribution core system (Linux kernel 4.9.35, patched 30 June 2017: latest release is 4.12) and gets regular security patches and bug fixes. That’s even newer than my primary laptop (Kernel 3. 13.0, patched 26 June 2017). Considering all the little Rasperry Pi machines scattered around the house, it may be prudent to work on configuring them for diskless boot, in order to preserve the flash memory chips on-board.
As the resident computer “guru emeritus” in our family, I often get questions from family members about computers, particularly computer security. I’m not a Windows expert by any means, though I was briefly a Windows NT sysadmin in the mid 1990s and the Unix and GNU/Linux systems for which I was responsible had to coexist with, but independent from, Windows Server Active Directory domains throughout the first decade of this century. As the latest hacker disaster to befall the Windows world sweeps across the planet, I got this request from a cousin:
I was wondering whether you had any advice for us Microsoft PC users and the cyber attack which they predict is rolling our way. We don’t do online banking or bill-paying. We do have a lot of pictures and documents. Most of the pictures I have on a flash drive. Do you think they will only hit the institutions? Sounds like Europe was not prepared and was operating on an old system. Hopefully our country has a “heads up” to protect our government institutions, airports and banks.
We haven’t fired up our two Windows 10 instances since the news (one is Judy’s new laptop, which runs Linux from a thumb drive “all the time,” the other is a refurbished desktop we only use for TurboTax). But, when we do, the first thing will be to grab the security patches from M$FT.
1) Always install Microsoft updates as soon as they are released.
2) Any machine that is directly connected to the Internet (i.e., plugged into your DSL or cable modem instead of wifi or a router) is in immediate danger. So is any machine for which the router firewall is turned off or for which port forwarding is turned on for vulnerable ports. The “bad guys” use bots that scan the entire Internet looking for open ports to penetrate. The machine that handles our webcam has port 8080 (redirect to 80 internally) and 22 (secure login for me to access our systems remotely) open: the logs show hundreds of break-in attempts every day. Naturally, we limit access to accounts that present known secret encryption keys, and don’t write web applications vulnerable to code injection. Once an attack has gained access to an internal network through any machine, all the machines behind the firewall are vulnerable. We got hacked last year because I reinstalled the system and didn’t disable the default accounts before putting it back on the network. It only needs a few minutes exposure to be compromised, with the observed rate of attacks.
3) Downloaded programs, including mislabeled email attachments or web links, can deliver malware that will corrupt your machine: the ransomware currently in the news can get in through an open port without any help from the user, but also through “Trojans” (files that look like something you want or look innocent but aren’t). A firewall won’t help if you invite them in. The most common attacks are notices that appear to be from your bank or credit card company or utility provider that require you to open an attachment or click on a link to see the notice or respond. Since modern email apps and web browsers tend to hide the full header or complex URL it is very difficult to tell which ones are fake–misspellings and vague, non-explicit wording in the text are tell-tale, but the safe way to address these is to login to your account through the browser instead of the link in the message to check if it is legitimate.
4) Linux, OS/X, and IOS are much less vulnerable, as they are inherently more secure and a minority target (except for servers and routers, which is why our Linux gateway gets attacked so much). Security upgrades are much more promptly distributed, as well. Android devices, which are Linux-based, but tend not to be updated regularly, have become vulnerable. Older routers may also be vulnerable: make sure that external login/configuration is disabled. Newer routers may be configured for automatic upgrades, but still should not allow external login.
5) As always, good passwords are essential. Don’t use non-HTTPS web sites from a public wifi access or one that uses a web-page login rather than a wifi connection password. Anything that is convenient or intuitive is probably not safe. [See #9 below for more detail]
6) If you must use Windows, do keep up your virus protection subscriptions, even though the worst attacks may be undetectable.
7) If you don’t already do so, buy a USB hard drive larger than your computer hard drive and back up your computer regularly, or subscribe to a cloud service for your important files–photos and documents. Even if you don’t get hacked, hard drives have a half-life of about 3-5 years and fail with alarming frequency. Fans die and fry your machine, too: even if the hard drive is still OK, professional file recovery is expensive (an external drive dock compatible with your hard drives is a good investment if you know how to use it). Keep in mind that laptop hard drives are probably encrypted, so can’t be recovered easily if removed from the computer.
8) Just say “no” to Microsoft… I know, almost impossible. We use iOS (iPad, iPhone) and Linux exclusively for Internet use, but still need to fire up Windows now and then and put them on the Internet for Microsoft and other vendor updates, and file taxes, so we share the same dread as everyone else, plus the other burdens of keeping servers and web apps secure.
More…
9) As the WannaCry ransomware plague becomes better revealed, it appears that the primary attack is through the file-sharing protocol used by Microsoft, SMB, or Server Message Block. If you have enabled file sharing between computers or inadvertently have the service running even if you don’t connect with other computers on your network, you are vulnerable until patched. Even if your network is secure, i.e., you connect through a router and the firewall is turned on, using a laptop at a public access site can expose you. Needless to say, your own WiFi router needs to have a strong WPA2 password. If you have old equipment that uses WEP or no security, upgrade or reconfigure your network now. Even if guest networks (motels, restaurants, coffee shops, businesses, etc) have WPA2, you may be exposed to attack by other users (or compromised equipment) on the network. If in doubt, use your smartphone’s data plan on the cellular network instead of your laptop or wifi on your hand-held.
10) The latest information on computer exploits, although technical, is always available on http://www.US-CERT.gov, the United States Computer Emergency Readiness Team, a branch of Homeland Security. This site will have information on severity, what systems are affected, and links to security fixes.
Lastly, if you are hacked, the only recourse is to wipe the disk, reformat, and reinstall the operating system and restore your backed-up data files. In the event you don’t have a backup, it may be possible for a file recovery service technician to boot your machine into a safe operating system (like Linux) from an external USB drive, mount the drive as data only and recover your data files (if the drive is not corrupted or encrypted by the attack), but it is generally not possible to reliably remove the attacker’s files and restore the operating system without a complete wipe/reinstall. If the attack is ransomware, the data is not recoverable without the attacker’s decryption key. Even if you pay the ransom, you may recover your data, but the disk needs to be wiped and reformatted and not placed back on a network until the security fixes have been applied.
Afterword:
If you are curious about the concept of ransomware, hacking in general, and enjoy a good read, check out Neal Stephenson’s novel “REAMDE,” a techno-thriller about ransomware that attacks users of an on-line multi-user game. The characters include a credit-card thief (briefly), the game designer, Russian mafia, the Chinese hacker, and a Polish white-hat hacker, and the action flows from Seattle to China, Canada, and Montana. Warning: heavy on computer and gaming cultural references. Neal knows his stuff–it’s all realistic tech, if fantastic and wacky.
A while back, we wrote about rebuilding a crashed Raspberry Pi system. In the course of reinstalling the system (on a new chip–the old SD card that contains the operating system had “worn out”), we had made a fatal slip. This system happens to be our gateway system, i.e., connected directly to the Internet to provide us access to our files and some web services while out of the office. Unfortunately, this also provides the opportunity for the world-wide hacker community to try to break in.
Normally, we have safeguards in place, like restricting which network ports are open to the outside and which machines and accounts are allowed login access. However, in our haste to get the new system up and running as quickly as possible, we connected the device to the Internet to download upgrades before the configuration was complete, meaning the system was exposed without full protection for several hours to several days.
One hour’s worth of break-in attempts by hackers. Note that attempts to access system accounts (root, pi) are denied because we don’t allow external logins for these accounts. The accounts that are allowed are restricted to public-key authentication (basically, a 1700-character random password). Attempts on this one-hour snapshot come from four different sources: Porto, Portugal; Shanghai and Baoding in China; and Tokyo, Japan (possibly hacked machine, as the name is spoofed).
Now, our security logs record, on a normal day, hundreds of break-in attempts (see screenshot above). We aren’t the Democratic National Committee or Sony, just a small one-man semi-retired consulting business. But, the hackers use automation: they don’t just seek out high-value targets, they scan the entire Internet, looking for any machine that isn’t fully protected. If they can’t steal data or personal information, they will use your machine to hack other machines. If you use Microsoft Windows, you are undoubtedly familiar with all sorts of malware, as there are many tens of thousands of viruses, trojan horses, adware, ransomware, and other malevolent software that invades, corrupts, and otherwise takes over or cripples your machine. Unix systems are less susceptible to these common attacks, but, if an account can be compromised, or a bug in the login process exploited, eventually a persistent attacker can gain system privileges and install a ‘rootkit,’ a software package that replaces the common monitoring and logging software, redirecting calls through the rootkit, which hides its existence and activities from the reporting tools, even the directory listing utilities.
Once an attacker takes over a Unix or Linux machine, there is no limit to the damage they can do on the Internet, as Unix/Linux is the basis of most of the servers on the Internet, and can become as a SPAM server, web-spoofer, or hacker-bot itself. (Microsoft Windows Server has the rest, nearly half, and they are even easier to break into.) I began to suspect this might have happened when normal system functions failed to terminate or run correctly. We have a lot of custom software built on this machine, which runs on a scheduler. The machine got slower and slower, and it was apparent that the jobs run by the scheduler were never exiting, filling up the process table with jobs that weren’t doing anything, except taking up space, which is always at a premium. Clearing out all the jobs, restarting the machine, and starting the processes manually worked–until about 2:00pm. Very suspicious: a rootkit, once installed, can repair or re-install itself even if the administrator restores many of the co-opted command files by normal upgrades or by a conscious attempt to recover from the intrusion.
The main problem seemed to be with /bin/sh, the system shell, which is actually /bin/dash, a shared object. Cron, the scheduler, uses dash to run the jobs, where the normal user login shell uses /bin/bash, a non-linkable executable shell with similar functionality. A rootkit is generally constructed as a filter, wrapped around the co-opted commands, so it would be easier to link to the *real* /bin/dash in an undetectable manner from the filter program than it would to wrap /bin/bash. In this case, assuming an intrusion was the cause, something went wrong, rendering dash non-functional. Perhaps the intrusion was not compiled for the ARM processor used by Pi, though most of a rootkit would be scripted to be portable among different CPU architectures and Unix/Linux versions.
An analogy to the problem would be like finding out who let the horses out: it is easy to identify wolf or horse-thief tracks outside the barn when the door is barred, but, if you left it open and the horses have bolted, it is more difficult to find out what happened–the traces are covered. I did install some intrusion-detection software, but running it after the tracks are covered over is usual a futile effort. However, there were enough questionable traces to warrant taking corrective action. Besides, even if the problem had been caused by some inadvertent misconfiguration on my part (unlikely, considering the fact that the machine could be made to run for several hours before the problem reasserted itself), the solution was clear: reinstall everything.
The first step is to backup the data, including configurations. Now, this is not just an ordinary computer: Raspbian, the Debian-based operating system distribution designed for the Raspberry Pi computer, comes with a simple desktop intended to introduce new users to Linux. But, this machine doesn’t use the desktop and is not even connect to a monitor most of the time: it is an internet gateway, web server, and custom webcam driver, so has a lot of “extras,” both loaded from software repositories and written especially for this installation. Backups are important, since much of the software only exists on this machine. And, since we only have one camera, fail-over isn’t possible without physically moving the camera from one machine to another, not a trivial exercise, as the connection is on the motherboard rather than an external connector.
Now comes the glitch: Since the introduction of the Raspbian operating system, it has been based on Debian 7; but, since Debian 8 was recently released, a new version of Raspbian is also available. So, the machine was rebuilt with Raspbian “Jessie”, replacing Raspbian “Wheezy” (the releases named after Toy Story characters rather than just the numbers–as with Apple OS/X, Debian releases tend to have names in addition to release numbers). Installation on Raspberry Pi is not like other computers. Since there is no external boot device, the operating system “live” image is loaded onto the SD card that serves as the boot device and operating system storage. Initial configuration is best done without a network connection, since the startup password is preset and well-known.
So, avoiding that mistake (booting on a network with the default passwords, the single most preventable source of hacker intrusions), we booted with the network cable disconnected and a monitor and keyboard attached, changed the password and expanded the system to fill the SD card, set up the other user accounts, then shut down the system, removed the SD card, and mounted it on the backup server to finish transferring vital data, like the security keys and system security configurations. In larger systems with a permanent internal boot drive, such “hacker-proof” installation is done on an isolated network, but, since the boot drive on a Pi is removable, it is easy enough to edit the configuration files on another system.
So, with the system fairly well hardened by securing the system accounts and user accounts, it was rebooted attached to the network and the system upgrades and extra software packages (like the web server) installed. So far, so good. But, since we upgraded the operating system, the server packages were also upgraded, most notably moving from the webserver, Apache, version 2.2 to version 2.4. Apache has been the predominant web server software on the Internet for 20 years, so it is in a constant state of upgrade, for security and feature enhancements. Between version 2.2 and 2.4, many changes to the structure of the configuration files were made, so that not only did the site configuration need to be restored manually, but there was a fairly steep learning curve to identify the proper sequence and methodology by which to apply the changes.
Then, of course, were the additional Python modules needed to be installed to support the custom software, which involved downloading and compiling the latest versions of those, since Python 2 also upgraded from version 2.7.3 to 2.7.9 (we haven’t yet ported the applications to Python 3, which moved from version 3.2.3 to 3.4.2 between Debian 7 and Debian 8). Finally, there were other tweaks, like comparing system configuration files to update group memberships for access to the camera hardware, loading the camera drivers, and setting file ownership and permissions for data and program files.
We could have saved most of this by sticking with Raspbian Wheezy, but eventually, support for older systems goes away, and the newer systems are usually more robust and faster: open source software evolves rapidly, with new minor releases every six months and new major releases every two years for most distributions, and an average life span of five years for maintenance of major releases and a year for minor releases. As we said before, Linux is free, if your time is worth nothing. The price of keeping current is constant maintenance. Patch releases occur as they are available, with maintenance upgrades almost daily.
Finally, after a week of tweaking and fiddling, the webcam service is back up and running. And, the security logs show break-in attempts every few minutes, from multiple sites all over the world (one from Portugal recently, others from unassigned addresses–ones with assigned addresses undoubtedly come from computers that have been compromised and used as hacking robots, as hackers don’t want to be traced back to their own computers, ever).
So, the moral of this post is: don’t ever expose a stock, unmodified computer system directly on the Internet (which is difficult to do, when all upgrades, new software, etc is available only through download from the Internet–which should be accomplished only from behind a proven firewall). But, you can set passwords and change system accounts before joining a network. And, if your computer is hacked, take it to a professional, and don’t grumble about the cost or time it takes to restore it. Pay for malware scanning software and keep your subscription up to date, as well as scheduling upgrades on a regular bases. And, if you are a professional, don’t take shortcuts (i.e., install and configure off-net or behind a firewall), keep good backups, install intrusion-detection software early, and check for security upgrades daily. Change the default passwords immediately, and create a new, weirdly named administrative user, and deny external logins for all administrative users. Use two-factor authentication and public-key encryption on all authorized user accounts. They are out there, and they are coming for your computer, even if you don’t have data worth stealing: they can use your computer to spread SPAM or steal data from someone else.
The new, $9 CHIP computer, which comes complete with WiFi and installed Linux OS (mouse, keyboard, power supply, and monitor not included, of course): running “headless” as just another network appliance, along with the five nearly as small Raspberry Pi computers and numerous virtual machines.
The last few months, our articles have focused on our bicycle adventures, notably, the preparation for, launch of, and, ultimately, termination of our planned four-month expedition from Florida up the east coast. We arrived home just less than two months after departing, and just in time to perform some much-needed maintenance on the Chaos Central computer network.
As the title of this blog indicates, we have, for the last 25 years or so, depended on Unix, Solaris, and Linux for both our livelihood and, of course, to operate our in-house network. The majority of our systems run GNU/Linux, in various distributions: Ubuntu and Mint Linux on desktops and laptops, CentOS on the server and virtual machines, and Raspian on the collection of Raspberry Pi micro-machines (and the above CHIP nano-computer) that are rapidly becoming the backbone of the home network.
GNU/Linux is very stable: we have, in the past, run systems for up to two years without a reboot–and then only because we suffered a major power outage. But, with a collection of systems, something is bound to go wrong. First, less than a month after we left on our trip, a power surge that made it through the power conditioner/battery on the server took out the virtual machine that renders the timelapse videos from our driveway surveillance system. We actually didn’t notice this until I went to review the current day’s timelapse progress and found the video was an hour out of date. Ah, this was because I had programmed a failover plan into the system: the videos were now being rendered by the Raspberry Pi cluster in the basement, much, much more slowly on a 32-bit ARM single-core processor with 512 MB RAM than on the Intel Xeon quad-core processor with 8GB RAM in the virtual machine host.
The server rebooted without incident when we got home: it actually didn’t reboot when the power hit came, but had an error that locked up the processor, an unusual condition. Had we not been headed for home at the time we discovered it, we could have instructed our house-sitter in how to cycle power and bring up the system.
Then, a couple of weeks after we returned, the surveillance system, which is also the remote login gateway, simply stopped, which would have been a show-stopper had we not been at home.0 We happened to have a spare Raspberry Pi, one that had seen duty as a print server and scanner server before we got a new WiFi-enabled printer/scanner. It took a couple of hours to add the necessary software packages to run the camera and web server and configure the machine to perform all of the necessary duties of the old one, including limiting access to specific machines and login accounts, and we were back in business–for a while. A few days later, the external disk drive that we use to store the camera output had an unrecoverable error. The files affected could not be erased due to the error, but renaming the folder and creating a new one kept the system running until we could get a new disk and copy the rest of the files onto it. The old disk had been re-purposed from use as a portable backup for travel, and is about six years old, so it’s time to replace it, anyway.
After taking care of the disk issues, I revisited the Raspberry Pi failure: it turned out that the SD card that the Pi uses as the internal system drive had simply expired of natural causes. SD flash memory chips have a finite lifetime, and can be rewritten only so many times before becoming useless. The culprit here was the surveillance system software (which I wrote, so I only have myself to blame)–even though the camera photos, taken every 10 seconds, are written to the external hard drive, my program copied the latest one to the system disk, in the web space. Every 10 seconds, 8 to 18 hours a day, for a year and a half. That’s about 2 million writes, all in the same location, in addition to logging system activity. So, a simple fix to preserve the new system: put the web file on the external drive.
The discovery of the worn-out SD card meant that the old Raspberry Pi was still OK, it just needed a new system drive. About this time, I replaced my 3-year-old Android phone with an iPhone. I had installed an SD card in the old phone for photos, so I removed that, backed up the files, reformatted it, and built a new Raspian “Jessie” operating system on it (the rest run the older “Wheezy” version), and booted up the once-dead Pi. Yeah!
This uses up nearly the last of the 8GB cards around the house, though I still have a 2GB card in an old Kodak camera that I use to document our Warm Showers bicycle visits. We have a few 16GB cards yet: the smallest cards on the general market (Costco, Best Buy, etc) are 32GB. I have one 64GB card, installed in a GoPro camera, which required installing an additional set of packages to handle the exFAT (no, not skinny: it’s an acronym for EXtended File Allocation Table) file system when copying files to Linux systems. New purchases tend to be the micro-SD footprint, since most new devices, plus phones and POV cameras, take those, and the older devices use adapters that are supplied (for now) in the package. Speed is important for high-resolution cameras and video devices. But, when cost is a factor, we still look for the lowest capacity and speed, as older devices have a size limit, and won’t operate with the new cards. In the age of mass-production, the devices themselves become obsolete while still functional because the supply of suitable storage media dries up.
So it goes–it has been said that Linux is free, but only if your time is worth nothing. It takes a lot of time to build a custom system, but the flexibility is enormous. Each machine takes on a personality of its own, as it develops different capabilities, selecting from among the many different distributions available and the thousands of software packages downloadable for free in addition to the basic system. Plus, the machines acquire a collection of custom scripts over time, that don’t exist anywhere else. As a software and web developer, having instant and free access to database engines, web servers, and many different programming systems is priceless.
When I need to run several different software systems or distributions, I can use virtual machines, running all versions at the same time, on the same physical machine. And, there are choices, with no buyer’s remorse penalties with free software. I’ve tried three different non-linear video editors, and stick with an older version of one (the new version isn’t compatible with the old project files…). The stock desktop systems come with web browsers and office productivity software, and several different graphical desktop systems, which can be chosen at login time.
The latest addition to our Linux/Unix obsession is the CHIP computer, by Next Thing, which I pre-ordered for $8 back in January and which arrived direct from the factory (in China) a couple of days ago. The CHIP is a bit smaller than the Rpi, with no HDMI (TV output), only one USB port, but a built-in 4GB flash drive, WiFi, and Bluetooth, which are all not included in the Rpi. The CHIP is low-power, has a battery connector (3.7v rechargeable battery not included), and can be programmed via the micro-USB power cable if connected to a laptop. This device is more suitable to mobile (read: robotic) applications, as, like the Pi, also includes a number of digital/analog input/output circuits. And, being a full-featured Linux computer, is more versatile than the Arduino micro-controller popular for hobby embedded applications. Unlike tablets and phones, which are powerful miniature computers in their own right, and microcontroller-based devices like thermostats and security systems, these small experimenter’s devices are completely programmable and physically extensible, becoming whatever tool your imagination can envision.
So it goes: in our 21st-century cottage (built in the early 20th), computing devices are as ubiquitous as light bulbs, with Windows becoming as irrelevant and obsolete as incandescent lights. But, “some assembly required” becomes “a lot of assembly, some compiling, and a bit of fabrication essential.” And, you may have to write your own documentation, operations manual, and maintenance plan, as well as some software.
The last Friday in July (today!) is the 16th annual system administrator appreciation day, an obscure celebration started in 2000 (by a system administrator, of course) as a response to an H-P ad showing users expressing gratitude to their sysadmin for installing the advertiser’s latest printer. To my knowledge, none of us have ever gotten flowers or even donuts on “our” day, but it does remind us in the profession that our job is to keep the users happy, mostly by keeping the machines happy, but also by attending to their needs in a prompt and professional manner.
I was reminded of the event not only by notices in the discussion forums and IT email lists, but by the fact that today, the replacement memory module for our network server came, and I installed it. A simple procedure, but one that takes a fair portion of the sysadmin’s bag of tricks and tools to accomplish. Bigger shops might have a service contract with the hardware vendor, but in many cases, the sysadmin is also the hardware mechanic.
For a few months, the server, a Dell T110, has been crashing every few weeks, fortunately not while we were on our two-month grand tour, but of concern, naturally. especially because it is a virtual machine host, and often has a half-dozen virtual machines running on it, which means, when the server goes down, half of our network goes with it. Virtualization is a great way to run different versions or distributions of operating systems when developing and testing software, so not too many have production roles in the network, but it is still an inconvenience to have to restart all of them in event of a crash.
A red light appeared on the front panel of the server, indicating an internal hardware condition, so it was time to check it out. First, hardware designed for use as servers (the T110 is aimed at small offices like mine) is a lot more robust than the average tower workstation you might have on your desk. Note above the heavy-duty CPU heat sink (air baffles have been removed for access to the memory modules–the four horizontal strips above the CPU fins). In addition, big computers have little computers inside that keep track of the status of the various components, like memory, CPU, fans, and disk drives, and turn on the light on the panel that indicates the machine needs service. Server memory has error-correction circuitry, as do most server-quality disk arrays, but this is limited to one error–the next one will bring the system down.
System administrators depend on these self-correcting circuits and error indications to schedule orderly shutdowns for maintenance, so that the machine doesn’t crash in the middle of the workday. For most offices, this means late evening or weekend work. For 24-hour operations, like web sites, it means shifting the load to one or more redundant systems while the ailing one is repaired, so no data is lost. Companies like Dell supply monitoring software to notify sysadmins of impending problems, which is vital to operations where there is a room full of noisy servers and the admins are in a nice quiet office in the back room. In our case, with just one server, we don’t use the monitoring software regularly, but it is useful for telling us which component the red light is for; then we can look up the location in the service manual and order the right part, and hope the system doesn’t crash before it arrives.
Normally, businesses that are thriving and need to keep competitive in the market replace their machines at least every three years. Others, like ours, that operate on a shoestring and buy whatever resources we need for a project when we need it, tend to run machines five years or more, sometimes until repair parts are no longer available: since we run Linux, we have machines eight years or older that are still useful for running some network services.
Our server is almost five years old, so when I order replacement parts, they don’t always look like the ones we took out, or have the same specifications. For that reason, I usually take replacement as an opportunity to upgrade, replacing all of a group of components with the a new set, which I did when a disk drive failed a couple years ago. However, this time, since I’m semi-retired and don’t have a steady cash flow, I only ordered one memory module, to replace the failing one. Memory comes in pairs, so having slightly different configurations in a pair causes the machine to complain on startup, but it still runs. The “upgrade” alternative would have replaced all four modules, or at least the two paired ones, with the larger size, at a cost of $150 to $300 instead of just replacing a $50 module and putting up with having to manually restart the machine on reboot.
So, the sysadmin not only needs to keep the machines running, but running within budget, and making sure the operating systems and hardware capabiities can support the software users need to do their jobs. If he or she is doing their job right, there won’t be any red lights in the server room, and the sysadmin will look like they aren’t doing anything…