Category Archives: All things Unix

SUNset: The End of an Era

The last of our SUN SPARC machines heads for the recycle center

August 15, 2012 — The last of our SUN SPARC machines headed for the recycle center today. Woodrow, the Ultra 30, had been “cold iron” since we moved to Chaos Central three years ago, but Xavier, the Sun Blade 100, had been running until recently, when we extracted the disk drives, one of which lives on as an external drive on Zara, our main Linux desktop workstation. The two machines, along with xavier’s CRT monitor–also still operable, but with a resolution only supported by Solaris–were accompanied by five defunct UPS units deemed either too obsolete to bother replacing the batteries or which had failed electronics.

This weekend, the Unix Curmudgeon is scheduled to migrate the last Solaris SPARC production system to a new Linux virtual machine at our major client site, which will effectively end 22 years of computing on the venerable RISC platform. While the SPARC line of CPUs lives on under Oracle’s mantle, and the Solaris operating system (Intel version) still lurks as a client on one or more virtual machine hosts, SUN Microsystems is now only a footnote in the history of computing. Our association with SUN–which was one of the first Silicon Valley startups, emarging from the Stanford University Network (SUN), in 1982–began in the spring of 1990, with a SPARCstation 2 running SunOS4, in a project room at Seattle University. The machine was part of the development of the Proteus project, an early compute cluster, for which our Master of Software Engineering project team was building the operating environment.

After our portion of the Proteus project was completed (a spectacular demonstration of parallel programming in a simulation of the hypercube-connected 1000-CPU cluster), my next encounter with SunOS and SPARC and introduction to Solaris was when I took a contract assignment as system administrator for the U.S. Army Corps of Engineers, managing a SPARCstation 20 running SunOS4, and Sun Netra and Ultra 2 systems running Solaris 5.5.1.  When I moved on to Concurrent Technologies, which had similar hardware, including a Solaris 6 system, I picked up a refurbished SPARCstation 20 for my home office and loaded Solaris 7 on it.  At the end of my assignment with CTC, I installed a pair of Ultra 10 workstations as servers in a co-location site in San Diego. At the University of Montana, I installed a Sun Enterprise 250, then moved on to the NIH’s Rocky Mountain Laboratories, which had a pair of Sun Enterprise 450 servers, which were in turn supplanted by an Sun Enterprise 880V, which I later replaced with a Sun Ultra Enterprise T5220, the machine which is being taken out of production, having been displaced by Linux clusters and VMs.

Meanwhile, in my home office, I had added the Sun Blade 100 and replaced the SPARCstation 20 with the Ultra 30.  At “the Lab,” I also had used a Blade 100 as a workstation and replaced it with an Ultra 20 (new generation, not to be confused with the old Ultra line), and had tended a group of servers and workstations, an Enterprise 220R and several Ultra 2000 and 3000 workstations, that had been “donated” from projects at the main NIH campus in Bethesda.  Most of the SPARC machines were simply “retired,” as they were replaced with larger systems or passed beyond their supported life spans and replaced with newer models, or simply outlived their projects.

The S20 and U30s in my own office were the only ones that succumbed to “natural causes,” having expired from running too long in un-airconditioned spaces (S20) or being powered down too long (U30).  It was hard to let any of these faithful workhorses go, but eventually one must make room for new technology and move on.  It is doubly sad that the iconic SUN logo will no longer grace any of the systems in the stable, either at Chaos Central or the client sites.  After 22 years of seeing the familiar blue-and-beige boxes at home and work, we will miss them.

Drivers Wanted — Call 1-888-GOT-UNIX*

*not a real phone number

The English language has been characterized as one that not so much borrows from other languages, but outright steals, remakes, and repurposes words to suit one or many purposes. So, the “Drivers Wanted” signs on the back of 18-wheeler trailers has different meaning to Unix system administrators. But, just as the long-haul trucker operates the tractor that gets your load hauled to where it is needed, a Unix device driver is a piece of software that gets your data to and from an external device, like a hard disk, dvd, printer.

Recently, Chaos Central experienced several close encounters of the frustrating kind with device drivers. Unix device drivers have always been a challenge, but the growth of open source software has made them even more so. Device drivers have traditionally been provided by the manufacturers of computer equipment, either through iron-clad non-compete agreements with commerical Unix vendors (like Oracle and Apple) for source code or operational specification for the device or in binary form in the case of Windows computers. The problem arises with Linux: in order to be distributed with the GNU open-source licensing, all components must be open source and freely redistributable, and many device drivers are not.  At best, drivers come as object files and header files that can be linked with the kernel version at the current patch level.

Device manufacturers develop and deliver driver software in binary form for Microsoft Windows, because they sell most of their products for use with computers running Windows. However, since the hardware is proprietary and the software required to interface with it reveals much of the inner workings and design, vendors are reluctant to release the source code to Linux developers. Also, because the Linux market is relatively small, they are also reluctant to produce binary versions compatible with Linux. In cases where there are binary drivers for Linux, the drivers are specific to kernel revision. Since the Linux kernel is frequently patched and has at least a minor revision release every six months, keeping a complete configuration inventory is problematic, even for systems that are supported, which brings us to the first pothole in the information highway…  And then, there is the problem of keeping drivers current with firmware changes in the hardware.  The devices themselves are often programmable circuitry, with flash memory.  Changes to firmware usually also require changes in the host driver.

Fiber Optic/Multipath Disk Drivers

It is no secret by now that the majority of Internet servers and almost all high-performance computing systems run Linux. High-end server hardware is most often attached to high-end storage systems as well, most of which achieve high throughput by using fiber optic connections. So, most fiber channel Host Bus Adapters do have device driver support for Linux, either available directly from the vendor or, in the case of supported Linux distributions, as part of a paid service package from the Linux provider. But, these choices also present what is basically a fork in the configuration management of the drivers: for instance, the Red Hat version does not always match the QLogic version. The reason for this is complex, but usually the hardware vendor wants to keep the drivers and firmware as up-to-date as possible to correct problems and support newer platforms and storage devices. The operating system vendor wants to keep older systems running smoothly, as well as supporting newer systems. Both need to balance the cost of maintenance against the number of installed systems. Inevitably, combinations of hardware and software exist that are not supportable, or that require some additional work by the system administrator to maintain a working system.

Thus, it came to pass that, during a routine patch-level upgrade to bring a pair of systems to the same Red Hat 5.8 level, a warning popped out on the console, announcing components of the fiber-optic driver were missing. Fortunately, Linux systems are built in such a way that the system software libraries maintain compatibility with all patch levels of a particular kernel release: Red Hat does not update the kernel release through a major distribution release, so it is possible to update the system software without updating the kernel patch level, so that the drivers do not need to be rebuilt. It was a simple matter of editing the grub.conf file to reboot using the old kernel version (which is why Linux retains the older kernels in the Grub menu). Even though most device drivers are now loadable at run time, if they are needed during the boot process, they must be integrated into the boot image. In the case of these machines, the boot disk appears to be local, but the driver is needed to mount the data drives for the server application. Being able to build a boot image with the driver is usually a pretty good indication that the driver will load when needed.  Also, many high-end systems use multiple paths between the host bus adapter and the storage array, for both fail-over and to increase throughput.  Multipathing needs to know what devices support it, so even if the system will boot without the device driver, multipathing may not work.  In most Linux systems, the device mapping will differ between multipath and non-multipath configurations, so the correct devices will not be available if the drivers are loaded in the wrong order.

But, it is desirable to have the kernel patches installed as well, so the problem of building a boot image with the driver installed had to be solved. There was a utility package for rebuilding the current image, but that presupposes that the system would actually reboot without the driver installed. A bit more research turned up a manual procedure for building a boot image from any kernel, which appeared to work. To complicate issues, this upgrade was for a remote client, so standing at the console to select a fall-back kernel in case the reboot was unsuccessful was not an option. In these cases, it is necessary to have someone standing by at the physical machine to intervene if necessary during the reboot. So, final resolution is awaiting the next maintenance availability for the system in question.

Printing

Meanwhile, in our own office at Chaos Central, we have been struggling with printing problems for months. Our aging Xerox color laser printer has been printing slowly or simply hanging in mid-print for some time. Canceling the job, then turning the printer off for 30 seconds to clear the queue and back on again worked, sometimes. And, when the printer went into “Power Save” mode, it wouldn’t wake up when we sent it a print job. We had recently replaced the expensive print cartridges, so the budget wouldn’t support a new printer right now. But, the symptoms seemed to indicate a driver problem more than anything else, since the problems seemed to crop up with a recent system upgrade.

Printing from Unix has always been an issue. From early on, Postscript has been the de facto common printer language, so we were faced with buying Postscript-capable printers in the 1990s. With the advent of CUPS, the Common Unix Printing System, it is possible to use almost any printer for which a CUPS PPD file is available, and to use job control features in various Postscript printers. Again, vendors defer to Windows for their proprietary print drivers, and send the Linux user elsewhere for Linux “drivers.” However, all PPD files are not created equal, and they change from time to time.

Before throwing out the old printer, we decided to try bypassing CUPS entirely. Most network printers support multiple printing protocols, one of which is FTP, the venerable File Transfer Protocol, used since the days of dial-up modems, but deprecated in the Internet age due to lack of encrypted security measures. Normally, the Network Police (i.e., the network security administrators) insist that this feature be turned off. But, with it turned on, it is a simple matter of uploading a Postscript file to the printer, using anonymous FTP. Which we did; and, voila!, it printed.

So, a search for a new printer driver ensued. We did find one at the Xerox site, with a PPD file several times larger than the generic one we were using that came with the latest CUPS package. And, the printer now wakes up from a sound sleep and prints! Yes, it is very slow yet, but it does print. The slow printing seems to be associated with different pieces of software that render the print selection to Postscript, but we can at least produce output more or less on demand.

Network

It has been some time now since we’ve had network problems, but getting connected “on the road” was a major problem until recently, as Linux drivers were not available for a lot of wireless cards bundled with laptops.  The choices for a while were to run a utility that mapped the system calls in Windows drivers to Linux systems calls, which had variable results.  Fortunately, many vendors now provide support for Linux and it is possible to buy laptops now made to run Linux.  But, as recently as 2011, using our old (ex-Windows) systems, we had a problem with the connecting layer between the driver and the I/O that reset the network connection every couple of minutes in an attempt to keep connected with the strongest signal.  In hotels or congested areas where there were multiple access points broadcasting on the same channels, this effectively precluded any kind of meaningful network experience and certainly made persistent connections (like file transfers or remote administration) impossible.  This issue seems to have been corrected in later versions of the wireless software, but there is always the next surprise waiting in the world of Linux device drivers.

Virtues of Virtualization

Well, it finally happened.  We keep a secure login to our home network open just in case we need files that aren’t on the laptop, or, to use as a relay point when locked behind a customer’s firewall, or, well, because we can.  While on travel, our gateway machine at home ceased responding.  Oh-oh.  When we got home, our fears were confirmed: the power supply had failed.  It just so happens that the gateway machine hosted some virtual machines that we need for vital business, so we had to get it recovered quickly.  It’s not a good plan to have valuable data on your gateway machine: that also has been corrected–we moved the gateway to another box.

In most of these cases, we might be faced with installing the software (which, unfortunately, was a Microsoft Windows application, so there are licensing and compatibility issues) on another machine, then restoring the data from backup, and so on.  Furthermore, we discovered that, while the user files and configurations were backed up, the volume containing the virtual machine images (along with the data) was not.  Oops, again. Although we do manually backup the data from time to time, it was a bit overdue.  But, the disk was still good, so we popped the hard drive out of the failed unit, opened up another machine, plugged it in, and transferred the virtual machine image and configuration files to the second machine.

Viola! in the time it took to copy a 20GB file (we build virtual machines with as small hard drives as necessary, as they are usually special-purpose machines anyway), we had recovered the application and the data.  However, it wasn’t convenient to run it from the new host, so a few days later, we made room on another system and transferred the virtual machine image once again, this time across the network.  Of course, we immediately included the virtual disk images in the backup scheme, and changed our procedures so we shut down the system when not in use so we get a clean backup image.

One of the reasons we hadn’t been backing up the virtual disk image is that, when the system is running, the image is inconsistent and might not be recoverable anyway, unless we can snapshot it (i.e., capture a static image that can be backed up and recovered). With most systems, we simply run a backup client (which for rsnapshot, is simply the SSH daemon that is usually on anyway) on the virtual machine.  But, we don’t usually run an SSH service on Windows, so a different backup system needs to be implemented for Windows.  There are a number of operating-system-agnostic backup software systems available, even several open-source, but they aren’t as convenient as what we use.  However, losing valuable data is extremely inconvenient, so we need a different approach.

On our portable systems, we use Oracle’s VirtualBox to run our virtual machines. VirtualBox is intended for desktop use: the VM is the property of a single user and can be easily migrated from system to system or hosted on networked storage and launched from one of several different workstations.  The most frequent use in this case is to virtualize Microsoft Windows systems for running those few applications for which we do not have an equivalent Linux application or which will not run under the WINdows Emulator (WINE).  For training, we often use VMWare appliances, which are also easy to install and migrate.  Within our network, the network services–such as DNS, web servers, and file servers–and development for multiple Linux, BSD, and Solaris distributions are virtualized on Citrix Xenserver, running on a server-class machine dedicated to hosting virtual machines.

Recently, we attended a seminar on Ganeti, which is Google’s answer to keeping virtual machines running all the time.  We are thinking of migrating our systems to Ganeti, a cluster management system for virtual machines that keeps mirrored copies of virtual machines on multiple servers in a cluster, so that the VM is always available, even if any single node fails.  And, if hosted on three or more machines, any two nodes can fail without loss of data or incurring downtime.  This will solve the issue of backing up non-Unix VMs and the several hours of downtime needed to restore a backup to another system.

Virtualization is the future of computing, where we depend on having all our data available all the time, or need multiple systems but have desktop space for only one.  There are some performance and technical issues, such as enabling audio and accessing optical and flash drives, but using local virtualization like VirtualBox and VMware appliances on a workstation helps solve that problem, as long as the VMs get regular snapshots or are shut down during system backup times.

After the Storm: Powering up the Virtual Data Center

There is something to be said for colocation and cloud services, where organizations keep their data and maybe even physical servers in remote facilities that share backup generators, redundant air handlers, multiple network paths, and distributed systems with fail-over redundancy.  But, ultimately, unless you are a true road warrior and connect  wirelessly from coffee shops and airline waiting areas to your exclusively cloud-based or colocated resources, you will need to manage your own network in the face of long-term power outages.

Here at Chaos Central, we have the usual UPS units lurking under the desks, which we largely ignore until the power goes out and they start beeping.  The home office/small office versions of these small units that more or less promise Uninteruptible Power to your computers are best for those momentary power glitches that plague any power grid during uncertain weather or simply human error at the control center: the lights blink, the power supplies beep, and computing goes on.   During winter ice storms and summer heat waves, when everything goes dark for minutes or hours, one of two things happens:

Ideally, you have a flashlight nearby so you can see the keyboards of the servers and workstations well enough to save your current work and run through shutdown cycles (for those machines that don’t have software wired to the power supply that automatically do this) before the batteries run down.  If you have done your system design correctly, you have purchased enough capacity to run the systems for five to fifteen minutes on battery, long enough to shut down the systems gracefully.

On the other hand, if you haven’t provisioned properly, or if you have not paid attention to the age of that black or beige box under your desk, the systems will shut down ungracefully, sometimes in mid keystroke.  Batteries need to be replaced every 3-5 years: the units themselves continue to improve, and the electronics also have been known to fail, so replacing the entire unit is sometimes easier and not that much more expensive.  A good rule of thumb is to get a new UPS when you buy a new computer.  Laptops have a built-in UPS, the battery, so all you need is a surge protector.

In either case–orderly or disorderly shutdown–in an extended outage, the UPS units need to be turned off and everything gets quiet.  When the power is restored, it all needs to be turned back on manually.  (Tip:  if you work from a home office and travel often during “the season,” it is a good idea to have a “network sitter,” someone you trust who can go to your house and turn the critical systems back on after an outage, if you need remote access to your network: we’ve been left “out in the cold” several times over the years, and, yes, have had a network sitter from time to time and it has paid off).

At Chaos Central, some systems come on with the line power, but some need to be manually started as well.  Our main server is a Citrix XenServer, which hosts a variety of systems, some of which are used for network services and some for experiments and development projects, so we leave those to be manually started from the XenServer console.  But, the NFS shares and system image shares need to be connected from the XenCenter GUI, which only runs on Windows.  We keep a Microsoft Windows XP image (converted to VM from an old PC that now runs Linux) in the virtual machine stack for that, but, in order to use it, we have to attach to it from a Linux system that runs XVP.  Finally, after all the network shares, DNS server, and DHCP server are started, we can boot up the rest of the VMs and physical machines.

The next step in the process is to prime the backup system.  Here at Chaos Central, we use rsnapshot to do backups, with SSH agents to permit the backup server to access the clients.  The agent needs to be primed and passphrases entered into it.  The agent environment is kept in a file, accessed by the cron jobs that run the backup process.

ssh-agent > my_agent_env; source my_agent_env; ssh-add

starts the agent and puts the socket id in a file, then sets the environment from the file and adds credentials to it.  Previously, each backup client has had the public key installed in the .ssh/authorized_keys file.  And, of course, since we use inexpensive USB drives instead of expensive tape drives, we need to manually mount the drives: those things never seem to come up fast enough to be mounted from /etc/fstab.   There is a setting in /etc/rsnapshot.conf to prevent writing to the root drive in  case you forget this step…

We also permit remote logins to our bastion server so we can access our files while on travel.  After the system has been off-line for a while, we can’t count on getting the same IP address for our network from our provider, so we have a cron job in the system that queries the router periodically for the current IP address, then posts changes to a file on our external web server shell account.  This also requires setting up an SSH agent.

Now, we are all up and running, usually just in time for the next power outage: here in the Pacific Northwest, winter ice storms usually result in several days of rolling blackouts.  Yesterday was up and down, today has been stable, except for a few blinks that don’t take down the network, so the backups are running and all the services are up.  Our son and grandsons have arrived with a pile of dead computers, cell phones, and rechargeable lighting, having endured two days of continuous blackout in their larger city.  The network is abuzz as everyone jacks in  to catch up on work, news, mail, etc.

It’s a lot of work to run a multi-platform network in a home office, but worth it, when you consider that we didn’t have to scrape snow and ice off the car, shovel the driveway, or brave icy streets and snarled traffic to get to “the office.”

The benefits of telecommuting outweigh the chore of keeping up the network.

Teaching New Dogs Old Tricks: Ubuntu for Unix hacks

Here at Chaos Central, we’ve been wildly excited about our new crop of Ubuntu 11.10 Linux computers from Zareason, the arrival of which was discussed here a couple of weeks ago.  But, thrilled as we are with the speed and capacity of these boxes, there is a learning curve–for the box, not the user.  Unix, and, by extension, Linux, has been evolving for more than 40 years now, and has a huge library of Useful Things accumulated, and a history of competing distributions, each with its own “flavor” and set of favorite tools.

In the beginning, Unix was a toolbox for scientists and engineers to build things quickly and cheaply: initially, document processors, and utility software, and, later, the Internet itself.  The growing popularity of Linux, and particularly the Ubuntu distribution, has driven the need to make it useful for the things “ordinary” people–i.e., those who don’t make a living from tweaking the innards of Big Iron computing–like to do, like surf the net and manage their music, video, and image collections.  And, since Linux is still what to do with your old PC when it gets too bloated with malware and  spyware, the popular distributions still need to fit on a CD.

Most industrial-strength distros now come on DVDS, sometimes more than one, containing the entire Linux collection of free software. but, for the masses, armed with CD-only PCs, something has to give, which often are the venerable, legacy Unix features that most home users will never need.  But, many of those are what we’ve been lugging around on our hard drives for 20 years or more:  venerable editors like emacs and vi, superceded by menu-driven simple editors or integrated graphical development environments but still more powerful and with more features than anyone will ever use, and for which the ones I’ve learned, are extremely useful and well-practiced enough to be second nature, so I keep using them, even if they don’t come with the system anymore.  And, more recently, enterprise-level tools for building massive compute clusters, like GridEngine and MPICH2, along with software development libraries and specialized utility libraries for science and engineering.  We also need a lot more development tools than come with a standard desktop, since we develop software for the web and for high-performance computing clusters.

Fortunately, the Ubuntu software repositories have a lot of those tools packaged up and loadable from the Software Center application, so we don’t need to go through the ritual of downloading, unpacking, configuring, compling, and installing nested sets of dependent programs like the old days.  But, our old computers have accumulated a unique set of software over the four or five years of their busy lives, so the new ones have a lot to learn:  we first had to load the no-longer-included Synaptic Package Manager to grab some of the software libraries and utilities not available in the Software Center catalog.  And, of course, get rid of that silly Unity desktop that only works well for folks who only do one thing at a time with their computers.  We have to have lots of toolbars visible and lots of  workspaces to which we can jump with a single click, which Gnome gives us.

Not surprisingly, some of the more esoteric and least-used packaged software still have a few surprise unresolved dependency issues.  To my delight, GridEngine, a distributed job control system for compute clusters created by Sun Microsystems a dozen or more years ago, was available in the Software Center.  Since Oracle bought Sun a couple years ago, a lot of these tools have disappeared off the free download list at Oracle, folded back into the supported product lines, and the old packages are sometmes hard to find.

GridEngine is one of those transitional systems that, unlike the new applications designed to run under Gnome or KDE desktop management systems, was born and developed during the days of OpenWindows (contemporary with Microsoft Windows 2) and the Common Desktop Environment (CDE, which predates Windows 95), both high-end X-Window System desktop managers in their day.  X11 programs used to be a lot harder to write, and designed for networking more than having the client and server on the same workstation, so they tended to leverage the Unix philosophy of lots of little programs, each doing one thing well, working together, much more than the more integrated and abstracted desktop applications of today.

GridEngine is more likely to be installed on an Ubuntu machine as a client or, at most, an execution node in an ad hoc cluster, rather than a master host, so would not usually run the graphical grid manager, qmon.  But, Open Source being what it is, the whole package is available.  However, the “just works” philosophy of Ubuntu breaks down here, as the dependencies of the archaic and arcane OpenWindows flavor of the graphical component aren’t checked very thoroughly, and there is a bit of a problem.  The application depends on the X11 font server, a client-server application designed to facilitate running X11 clients on a server and displaying them on a different X11 server that might not have all of the requisite fonts loaded.  Also, because CDE relied heavily on licensed Adobe Truetype fonts, the chain of dependency gets broken when it comes to fitting old non-GPL’d software into a Linux distribution.

When you get a GNU/Linux system distribution, everything in it is licensed under the GNU Public License.  You can install anything you want in addition to that, but you can’t package the extras and redistribute it  This also extends to the packaging system.  The Ubuntu Software Center has provisions for adding non-free (i.e., non-GPL) software repositories, but they aren’t going to intereact with each other, so complex packages like GridEngine, that depend on non-free components, come with “some assembly required.”  In my case, someone had already solved the problem for Ubuntu, so a Google search turned up a list of the missing packages and how to install them.  I’ve used GridEngine for years, but on Solaris and RedHat Linux systems.  Solaris, of course, was licensed from Sun (now Oracle) and had full support.  The old Sun GridEngine for Linux packages came with the non-free fonts and dependent packages integrated, because you got them from Sun–they weren’t on any of the five or six CDs (now two DVDs) that comprise the full Red Hat Enterprise Linux (or, as many of us who don’t need any hand-holding support from Red Hat use–CentOS).

So, at Chaos Central, the new dogs are gradually getting housebroken and have largely quit chewing on the furniture, i.e, have learned enough so they–to use another analogy, don’t respond–like HAL9000 from “2001: A Space Odyssey,” “I’m afraid I can’t do that…” when asked to perform tasks the other computers have been doing for years.  They’ve even learned, with larger memory and faster processors, to do new things.