After the Storm: Powering up the Virtual Data Center

There is something to be said for colocation and cloud services, where organizations keep their data and maybe even physical servers in remote facilities that share backup generators, redundant air handlers, multiple network paths, and distributed systems with fail-over redundancy.  But, ultimately, unless you are a true road warrior and connect  wirelessly from coffee shops and airline waiting areas to your exclusively cloud-based or colocated resources, you will need to manage your own network in the face of long-term power outages.

Here at Chaos Central, we have the usual UPS units lurking under the desks, which we largely ignore until the power goes out and they start beeping.  The home office/small office versions of these small units that more or less promise Uninteruptible Power to your computers are best for those momentary power glitches that plague any power grid during uncertain weather or simply human error at the control center: the lights blink, the power supplies beep, and computing goes on.   During winter ice storms and summer heat waves, when everything goes dark for minutes or hours, one of two things happens:

Ideally, you have a flashlight nearby so you can see the keyboards of the servers and workstations well enough to save your current work and run through shutdown cycles (for those machines that don’t have software wired to the power supply that automatically do this) before the batteries run down.  If you have done your system design correctly, you have purchased enough capacity to run the systems for five to fifteen minutes on battery, long enough to shut down the systems gracefully.

On the other hand, if you haven’t provisioned properly, or if you have not paid attention to the age of that black or beige box under your desk, the systems will shut down ungracefully, sometimes in mid keystroke.  Batteries need to be replaced every 3-5 years: the units themselves continue to improve, and the electronics also have been known to fail, so replacing the entire unit is sometimes easier and not that much more expensive.  A good rule of thumb is to get a new UPS when you buy a new computer.  Laptops have a built-in UPS, the battery, so all you need is a surge protector.

In either case–orderly or disorderly shutdown–in an extended outage, the UPS units need to be turned off and everything gets quiet.  When the power is restored, it all needs to be turned back on manually.  (Tip:  if you work from a home office and travel often during “the season,” it is a good idea to have a “network sitter,” someone you trust who can go to your house and turn the critical systems back on after an outage, if you need remote access to your network: we’ve been left “out in the cold” several times over the years, and, yes, have had a network sitter from time to time and it has paid off).

At Chaos Central, some systems come on with the line power, but some need to be manually started as well.  Our main server is a Citrix XenServer, which hosts a variety of systems, some of which are used for network services and some for experiments and development projects, so we leave those to be manually started from the XenServer console.  But, the NFS shares and system image shares need to be connected from the XenCenter GUI, which only runs on Windows.  We keep a Microsoft Windows XP image (converted to VM from an old PC that now runs Linux) in the virtual machine stack for that, but, in order to use it, we have to attach to it from a Linux system that runs XVP.  Finally, after all the network shares, DNS server, and DHCP server are started, we can boot up the rest of the VMs and physical machines.

The next step in the process is to prime the backup system.  Here at Chaos Central, we use rsnapshot to do backups, with SSH agents to permit the backup server to access the clients.  The agent needs to be primed and passphrases entered into it.  The agent environment is kept in a file, accessed by the cron jobs that run the backup process.

ssh-agent > my_agent_env; source my_agent_env; ssh-add

starts the agent and puts the socket id in a file, then sets the environment from the file and adds credentials to it.  Previously, each backup client has had the public key installed in the .ssh/authorized_keys file.  And, of course, since we use inexpensive USB drives instead of expensive tape drives, we need to manually mount the drives: those things never seem to come up fast enough to be mounted from /etc/fstab.   There is a setting in /etc/rsnapshot.conf to prevent writing to the root drive in  case you forget this step…

We also permit remote logins to our bastion server so we can access our files while on travel.  After the system has been off-line for a while, we can’t count on getting the same IP address for our network from our provider, so we have a cron job in the system that queries the router periodically for the current IP address, then posts changes to a file on our external web server shell account.  This also requires setting up an SSH agent.

Now, we are all up and running, usually just in time for the next power outage: here in the Pacific Northwest, winter ice storms usually result in several days of rolling blackouts.  Yesterday was up and down, today has been stable, except for a few blinks that don’t take down the network, so the backups are running and all the services are up.  Our son and grandsons have arrived with a pile of dead computers, cell phones, and rechargeable lighting, having endured two days of continuous blackout in their larger city.  The network is abuzz as everyone jacks in  to catch up on work, news, mail, etc.

It’s a lot of work to run a multi-platform network in a home office, but worth it, when you consider that we didn’t have to scrape snow and ice off the car, shovel the driveway, or brave icy streets and snarled traffic to get to “the office.”

The benefits of telecommuting outweigh the chore of keeping up the network.