Traffic Congestion on the Information Highway, redux

A while back, we wrote about difficulties getting a clear wireless channel when traveling, interference between motels and truck stops, time limits on sessions, etc. We also noted similar issues with DSL, and suspected that the providers have either oversold the capacity or exceeded the limits of the infrastructure.

A few months ago, we had a flurry of DSL session disconnects, which resulted in frequent changes to our WAN IP address. Not normally a big deal, but, since we depend on getting access to our office computers when on travel, it was huge. At the time, $ISP suggested our modem was at fault, and we sprung some of our precious computing budget for a new one. Funny thing, the connection improved before we got the new one installed. But, frequent resets still happened, so we wrote and installed a script that queried the router once an hour, parsed out the address, and posted it to our outside web server.

A couple of weeks ago, the troubles came back. This time, $ISP offered to sell us a static IP address, which solved the problem of getting into the system from outside, but the resets became so frequent and transfer rates slow slow between that it was impossible to connect to a remote computer, download a file, or even read mail. Of course, weekends, when tech support is not available, are the worst, so we took the time to take some measurements and do some tests on our own network and equipment.

Because the throughput speeds and connection times varied so widely between normal in early morning to effectively dead in the evening, we concluded that the problem was not in our network, house wiring, or anywhere except in the greater phone system. So, after the usual fumbling handoff from the standard help desk service to more technical expertise, we finally got admission from $ISP that they had, indeed, grossly over-committed the capacity and were in process of bringing back Twenty-first Century technology to our fair city, after subjecting us to weeks of 1991-era data transfer speeds.

During this outage, our data rates, on a nominal 7Mbps DSL connection, dropped from the expected 5.5Mbps to 20Kbps, about the same as the 19,200baud modem we bought in 1993. Except, in 1993, $ISP didn’t drop the connection every two minutes, so our actual data rate in 2011 has been closer to the 2400-baud rate of 1991, before we upgraded to 9600 baud. In those years, we used UUCP (Unix-to-Unix CoPy) to fetch email from a Unix server in Seattle, long-distance (nationwide flat-rate plans were not available then). All of western Washington was area code 206, but long distance rates applied from town to town. There was no World Wide Web, but you could chat with other computer users and look up information through dial-up bulletin boards and commercial services like CompuServe and America On Line (AOL).

The arrival of the Internet in the mid-1990s brought a crisis in the phone system similar to the current one, as service providers ordered hundreds of lines to connect dial-up users to the Network, and computer power users ordered second phone lines for their computers. Crews could not dig trenches and lay cable fast enough to stem the tide. There was no relief until fiber optic trunk lines and the phone system turned from point-to-point analog connections to the digital packet-switched network we have today. And then came DSL. The phone companies pushed out fiber further and further to make high-speed over copper possible in those last few hundreds of meters, but clearly should have figured on having the capacity used up quickly, after two decades of unceasing growth.

Interestingly enough, while shopping around for alternatives to what is essentially an unusable service, I see that the same $ISP who is telling me they’ve grossly oversold their capacity but are putting in new capacity to correct it is–at the same time–selling even faster services to accommodate the growth of on-demand audio and video and evermore detailed on-line games, which will quickly use up the new circuits supposedly put in place to fix the current overload. And so it goes. All this leaves me wondering how long after the service level I pay for has been restored will it be before the information highway once again becomes gridlocked.

Taming the Office Network

We know a few folks who are still on dial-up Internet access at home, but most of us, at least those of us who are “in the business” have DSL or cable, some sort of high-speed broadband connection.  The service providers originally intended one connection, one computer.  But, many households now have more than one computer, and, in the case of Chaos Central, where we run two businesses, one of which builds software, web sites, and manages other networks,  we have “many,” most of which are virtualized, but look, to the network, like individual computers.

The standard DSL or cable modem now comes configured as a network server appliance, with Network Address Translation (NAT), Domain Name Service (DNS), and Dynamic Host Control Protocol (DHCP).  But, as an appliance, the little modem does a less-than-adequate job in each area, with limited control available to the user via a web menu.  At Chaos Central, we have been gradually migrating these network functions to “real” (Linux/Unix, of course) servers, for both performance and control.

Back at the Rocky Mountain Nexus of Chaos Central, in Montana, we used a community-wide wireless, established in the pre-DSL days.  The radio connection was configured as a bridge, so we built a FreeBSD-based router to handle the NAT functions, and put DNS on an internal server.  A separate wireless bridge/router handled DHCP for laptops and such, but the rest of the network had static addresses.

When Chaos Central’s West Coast Nexus coalesced, both the FreeBSD router and the DNS server were casualties of the move, so we relied on the [new] DSL modem for network services, assigning static addresses outside the DHCP scope for servers and workstations that needed to be accessed through SSH.  But, as the stable of virtual servers proliferated, the shortcomings of the DSL modem as a network appliance became painfully obvious.

NAT works by assigning a port to each client system, though which to tunnel requests in and out of the system.  The first DSL modem we had didn’t keep track of which client requested what on which port for some services, so services like FTP, that listen on one port and transmit on another, didn’t work unless the ports were explicitly forwarded to the specific client that needed to use FTP.  This wasn’t a big deal at the time, since the Unix side of the business uses mainly SSH and most public download services offer a choice of HTTP or FTP.  So, NAT, while not smart, works, most of the time…

DNS became an issue once there were too many physical and virtual servers to keep track of in /etc/hosts (LMHOSTS for Windows clients).  The “little appliance that almost could” uses dynamic DNS, by which the client offers its name to the server, so machines can find each other by name.  But, the user interface doesn’t allow a lot of configuration options, so it had to go.

When setting up a private LAN DNS zone, we like to use the form  “company.lan” and generate named.conf.local files and zone files accordingly.  In these, we list the static addresses and server names, and also assign names like “dhcp-2” to enough addresses in the DHCP scope to cover the likely number of clients.  Printers, wireless routers, and portable machines are easier to use as DHCP clients, and, as we shall see later, DHCP with reserved addresses can simplify static assignments as well.

Which brings us to DHCP.  The modem’s DHCP doesn’t allow for a lot of configuration.  Despite offering to do so through the interface, it just didn’t seem to “want” to substitute our LAN DNS server for the internal and external DNS services, which meant manually editing the /etc/resolv.conf file every time the DHCP lease was renewed on DHCP clients, if we needed to address local machines that didn’t use Dynamic DNS (a number of Linux distros do, but some do not–I personally don’t like DDNS because of the potential for name conflicts when you allow servers to name themselves).

So, the next step was to set up our own DHCP server–first turning off the DHCP service in the modem.  Having our own service allows us to specify the local DNS server, domain, search domain, and, better yet, to map the MAC addresses of various machines to static addresses outside the zone, to provide reserved addresses for those machines.  There is a definite advantage to having everything use DHCP–you no longer have to modify the network configuration on each machine if you move or add a service, you just change the service records in the DHCP server and renew the leases on the clients.  We’re using Ubuntu 10.10 Server Edition for DNS and DHCP, using BIND9 for name service software and DHCP3 for address assignment, running as a virtual server.  The setup was quite easy, but, then, we’ve been doing that for 15 years and had the DNS zone templates from the old site archived.  We’re used to hand-editing the files, but Webmin does a great job of guiding the new user through the setup process and managing the services afterwords.

The next step in the process of taking charge of your own network is to configure the DSL or cable modem as a pass-through device, add a second network card to a spare machine, and configure your own router, with a more reliable NAT and a more configurable firewall, and set up a local NTP timeserver.  But, that’s a future project.  Right now, we’re evaluating network storage solutions for $CLIENT, and took time out to clean up and fix things to make that easier.

 

System Administration Rule #1: Thou Shall Not Lose Thy Customer’s Data

Garrison Keillor opens the monologue on his weekly radio variety show with “It’s been a quiet week in Lake Wobegon…”  Well, it has definitely not been a quiet month at Chaos Central, which accounts in part for a long silence in this forum.  It began with an ominous question from $CLIENT one Friday morning…

“How do you recover an Amanda backup set manually?”  OK, innocent enough.  Sometimes it is faster to do it that way, and easier than installing the client software to restore data to a different system than the one from which it was backed up.  My response: the answer is found in the header of the first archive file, and you’ll need to figure out how to reassemble a sliced dataset.  Amanda is a popular Open Source backup tool for Unix.  $CLIENT runs it as a disk-to-disk application, which we do as well here at Chaos Central.  USB disks are cheaper than tape jukebox systems by far.  We, meaning system administrators in general, backup data regularly, so as not to violate the first rule of system administration:  Thou shall not lose thy customer’s data.  Sometimes, customers delete files by accident, sometimes they delete files on purpose and later find out that was a mistake, and sometimes disks fail.  We don’t guarantee you can get all of your data back, but, usually, if it existed for at least a day, it’s on a backup tape or disk and we can retrieve it, all things being equal…

On this particular week, things went wrong, in a perfect storm.  The data in question had been archived, which is to say it was not part of the active dataset, so was not being backed up regularly.  The archive was stored offsite, but accessible through the network.  Following the principle of “trust but verify,” $CLIENT had kept a backup copy of the data before it was archived.  Data is like forensic evidence–in order to be trusted, it must be maintained through a chain of custody that is trusted.  And, in the spirit of Rule #1, there must be at least two complete copies at all times.

As feared, the off-site archive copy had become corrupted, and the off-site storage agency had no backup of it.  Not all of the files were damaged, but the missing ones were critical.  So, out comes the original backup, which had been preserved on-site for just this contigency.  However, being somewhat “aged,” the backup index had not been updated to the current software revision level, so file recovery through the normal program interface was not an option.  At this point, we have copy number 1 damaged, and the backup intact but stubbornly non-recoverable.  Perfect storms require multiple unlikely and possibly unrelated events to coalesce, and some definable human error in judgment in dealing with the combination to become memorable bad examples and the stuff of books and movies.  Two unrelated events had converged now, and the ice was thinning rapidly, so to speak.  Back at Chaos Central, we were still blissfully unaware of the “whole story,” but would soon be drawn into the rescue operation.

The primary defense of Rule #1 is to ensure there are two verifiably good copies of any data at all times.  The correct response at this point would be to make another copy of the known good but inconveniently unusable backup before resorting to manual extraction measures.  And, that was the intent.  Except for a tiny flaw in the process, whereby a chunk of the backup (which, as we will see, was stored in individual blocks or slices–140,000 of them) was moved to the test area instead of copied.  At this point, the integrity of the sole remaining copy was compromised, but not yet beyond recovery.

What happened next involved yet another human error, caused first by an imperfect understanding of the exact semantics of the manual recovery procedure (partly due to exceptionally vague documentation) and second, by applying it in such a manner as to write into the directory containing the only copy of the critical first blocks of the archive.  The manual recovery procedure called for using the Unix ‘tar’ (tape archiver) command with a ‘-G’ option, which the manual says “handles the old GNU incremental format.”  Whatever that means.  Sounds innocuous, right?  A lot of these open source tools assume that you might be taking data from one system and importing it into another, and use the lowest-common-denominator functionality by default.  The word “incremental” implies “partial,” right?  So we should be safe using it.  No.  What it does, and what happened when our hapless S/A applied it with the target directory as the same directory containing the two archive blocks, was “remove files in the target directory that are not contained in the archive.”  Which are in the software design notes, but not in the user manual.  In my own cautious way, I don’t usually expand tar files into the same directory that contains the archive, as a matter of principle, so this normally wouldn’t be an issue, but it would have been annoying at best if used for a partial restore in a directory containing other files.

The effect was that, yes the first few files in the archive were restored, but now the beginning blocks of the archive were inexplicably missing.  Inexplicably, that is, until the investigation and rescue operation initiated from Chaos Central discovered the horrible truth about what ‘-G’ does.  The tape archive itself consists of hundreds of thousands of files packed into one, which is then compressed.  Losing the first two pages of the archive lost not only the first few files of the data, but also the dictionary to translate the compressed gibberish into the first chapter.  A simple data recovery operation now escalated into a data repair operation, requiring some advanced skills and much research, and not a small amount of luck.

A bit of research on the Web showed that, yes, the dictionary in a gzipped file is reset from time to time, i.e., in compression blocks, with a general description of how one could, with trial and error, find an intact one in what was left of the broken dataset.  But no readily-available implementation of a solution, anywhere.  So, I wrote one, a short Perl script that doesn’t attempt to find the first block, but the first one that starts on a byte boundary (compression is by bits, not whole bytes, so not all blocks do).

#!/bin/perl -w
# Repair a broken gzip file
# Input file is the tail of a corrupt gzip file.
# This script was written to recover a gzipped tar archive from
# a split gzipped file in which one or more segments are missing,
# such as a corrupt backup tape. Apply this to any segment.
# Script creates file "errmsg.txt"
# and uses newgzip.gz as a working filename
# for the recovery process
# NOTE:  Use this script as a guideline only--
#        adjust to fit your particular conditions.

# create a valid GZIP header (binary)
$ts=time();$header=pack("cccclcca8",0x1f,0x8b,8,0x08,$ts,0,3,"newgzip");

# open test slice and search for possible compression block boundary,
# shifting off first byte.
open(RAW,"<$ARGV[0]");
while ( read(RAW,$buf,32768,0) > 0 ) {
 $testfile .= $buf;
 print ".";  # output a dot for each 32K block to show we're working
}
print "Read " . length($testfile) . " bytes\n";
close(RAW);
while( length($testfile) ) {
 $fbyte = ord(substr($testfile,0,1)) & 7;
 if ( $fbyte == 4 ) {
  $testzip = $header . $testfile;
  open(RUN,">newgzip.gz");
  $wb = syswrite(RUN,$testzip,length($testzip));
  close(RUN);
  system("/usr/local/bin/gzip -t newgzip.gz 2>errmsg.txt");
  open(ERR,"<errmsg.txt");
  read(ERR,$rtn,4096);
  close(ERR);
  $rtn =~ y/\n//d;
  print  length($testzip) . ": " . $rtn . "\r";
  if ( $rtn =~ /crc error/ ) {
   print "Success\n";
   rename "newgzip.gz" $ARGV[1];
   unlink "errmsg.txt";
   exit();
  }
 }
 $testfile = substr($testfile,1);
}
print "Failed to find a valid compression block in $ARGV[0]\n";
unlink "errmsg.txt";

Well, there it is.  It prints out some dots to show it is reading the file, then prints out where it is in the file, so the operator doesn’t get nervous about what it is doing to the data or how long it will take (a fairly long time, as you may search a long ways into the file to find a usable key, and it reads and writes a file shortened one byte at a time).

So, we got a valid chunk of a compressed file that we could graft onto the front of the remaining 139,996 chunks ( we lost two, and had to search through two more before we found a compression block–you will lose some data with this procedure, but we were looking for specific files, and didn’t want to have to shift a terabyte of data bit by bit, so we looked for a block on a byte boundary).

But, now that we could decode the data, it was still gibberish, because it started in the middle of a file.  Fortunately, there was a Perl script available on the Internet (search for find_tar_headers.pl), which worked “out of the box” to find the start of the next file.  In this particular archive, we lost 30MB of compressed data off the front, and another 400MB of uncompressed data.  Fortunately, the files we needed to recover were further down in the archive.

For the remainder of the process, we added a dummy Amanda header to the recovered Gzip archive block to simplify the script we wrote that stripped off the Amanda headers from all of the chunks, concatenated them together (after discovering that the last block on each “tape” has to be discarded, as it truncates when EOT is reached and is restarted on the next tape), unzipped them, used ‘tail’ with an offset to strip off the unusable partial file from the front, and then extracted the files from the ‘repaired’ (but incomplete) tape archive.  The script ran for about 20 hours, working on the repaired 1.4 Terabyte backup dataset.  The files we lost off the front end were, fortunately, intact in the primary copy of the dataset, so we preserved Rule #1, not entirely by skill alone.

But, this could have been prevented, obviously, by following a few simple steps:  first, as it says on the cover of the Hitchiker’s Guide to the Galaxy, Don’t Panic.  Yes, the customer wants his data “right now,” but, given the choices of “soon” or “never,” I’m sure he would prefer the former.  Take time to plan each step of the recovery process, and carefully make a new plan if the first does not succeed.  Second, don’t use backup software to create archives.  Backups are meant to keep a copy of “live” data, that supposedly gets checked often enough to be correct.  And, the backup is refreshed from time to time and managed by a directory. Archives, by nature, preserve at least two static copies for an indefinite time, and require quite different directory and validation/verification processes.  Third, if one of the two copies of your data is outside your immediate chain of custody, it isn’t valid–you need a third copy.  Fourth, if you need to experiment with the data, experiment on a copy, not the original.  Fifth, if you are in a data recovery operation where only copy exists, use a team approach and double-check every step.  Plan what you are going to do, understand what the expected results of an operation will be and make sure the operation is repeatable or reversible.  And sixth, research your options carefully.  We learned that damaged gzip files and damaged tar files are at least partially recoverable, but at considerable expense in time and effort.  Above all, be careful.  Rule #2 says “If you violate Rule #1, be sure your resume is up to date or you have another marketable skill.”

Tour 2011, part 3: Traffic Congestion on the Internet Highway

This week has been predicted as the week the IPv4 address space gets used up.  Or, at least, completely allocated.  But, that’s just one problem with the increased usage on the Internet.  With network address translation (common in home networks, wireless networks, and small businesses), millions of computers are hidden in private LAN address blocks, so the address space problem isn’t as dire as you might think.

But, as we travel across the country, we have had increasing issues with wireless interference.  There are so many access points in so many places, the chances of channel collisions are quite high.  In two of the cities we have stayed so far, the wireless access points in two physically adjacent businesses have ended up on the same channels, resulting in enough interference to prevent Internet access.  Our computer gets an address assigned, from the selected ESSID, but getting a DHCP address through broadcast is very different than getting communications through on the channel, so the effect is no data transfer.

The solution we use at Chaos Central, where we, of course, control the wireless access point, is to select a different channel.  However, the staff at coffee shops and motels are  neither trained nor authorized to perform this simple action, so they just shrug and mumble something about asking the manager to call the phone company or guessing that it is just normal traffic congestion on their network and telling customers to just wait it out.

The phone company has similar issues.  Even if your internet service provider has a Class B network, only 65,000 clients can be served concurrently.  Like the airlines and hotels overbooking, they can certainly sell more accounts than they have addresses for, with the likelihood that not all the account holders will have their computers on at the same time.  The tricky part is how to deal with the possibility that the address space (or local DHCP scope) gets completely used up.

In high-usage areas across the country, we have noticed that wireless connections get dropped frequently, either after a specific time period, say 30 to 45 minutes, or after a period of inactivity, typically 3 to 5 minutes.  This also seems to be happening on the DSL networks, as the IP address at Chaos Central shifts from time to time, even though the modem is always on.

Now, such arbitrary disconnections often go unnoticed, if all you are doing is browsing web pages or reading email, but trying to run a VPN connection, encrypted tunnel, or large file upload or download is a trying experience with loss of connections.  A few weeks ago, this was happening at Chaos Central at an alarming rate: calls to the DSL provider’s tech support claims that “it must be your modem,” even though a DHCP client will normally ask for the same address if the connection drops.  However, the reconnect rate improved after the call, even before we replaced the modem (which was an old model anyway).  After a few weeks, the new modem changes address now and then, indicating to the Unix Curmudgeon that the problem is not so much with bad phone lines and cheap modems, but with traffic congestion on the network.

When a connection drops, whether because of deliberate disconnection due to timeouts or duration policies, if the modem gets a new address it is because the old one was issued to someone else in that brief time required to re-establish the connection.  Which, of course, means that there are more active accounts than available addresses, either in the network or in the DHCP zone for that circuit.

Obviously, in stressed economic times, things are going to get worse before they get better.  But, change is coming soon.  Like the splitting of states and metropolitan areas into multiple area codes to accommodate the explosion of “one person, one phone number” caused by Centrex dialing for businesses and personal cell phones, access to the Internet will undergo a revolution.  First, to implement IPv6, in which the hardware address of the network card in your computer is added to the IP address to expand the address space without network address translation, and second, to expand the wireless networking protocols to incorporate more channels and intelligent adaptive interference avoidance.

Meanwhile, we’re back to the early days of limited wireless access, seeking out connections where we can.  And, we deal with the mutable addresses at Chaos Central by having programs running that post the current address to a secure file location on our public web servers, so we can always login remotely.  Change is sometimes progress, and sometimes just an adventure.

Tour 2011, part 2: We Don’t Serve Their Kind Here

We’ve completed the first leg of Tour 2011, arriving in the City of Angels mid-day.  Our GPS took us to the Nice Person’s sister’s house, no problem.  The problem started when we needed to get connected to the Internet.

Sis and her hubby travel a lot, so they have cellular Internet access.  OK, we say, we’ll just plug the dongle into our computer and surf away.  But, no joy.  The USB device mounts as a disk drive, no problem.  But, the System Requirements (and provided drivers) are for Windows 7, Vista, XP, and 2000.  Period.  Guess what?  We’re the Unix Curmudgeon.  We run Linux!

Now, why does Verizon think that Linux users don’t need to use the Internet? Or, more precisely, why does Verizon think they can survive without the business of Linux users everywhere? 

OK, there are Linux packages that interface with the cellular modems, but they are not provided by the service vendors, nor do they support use of the device on Linux systems.  And, they are relatively hard to find.  Essentially, the cellular modem is just that — a modem, so setting it up as a PPP device and adding the configuration and chat scripts is all that is necessary to use them.  But, it has been a long time since most long-time Linux users (like the Unix Curmudgeon) had to configure modem scripts by hand, and most new Linux users are blissfully unaware of such arcania.

The aggravation here is not being able to use a device out of the box.  Using the cellular modem requires getting files from somewhere else (on the Internet) before you can get on the Internet.  Humph.  There is always the coffee shop down the street, but not being able to get Internet access almost anywhere is a major impediment to keep up with work on the road.

Musings on Unix, Bicycling, Quilting, Weaving, Old Houses, and other diversions

%d bloggers like this: