System Administration Rule #1: Thou Shall Not Lose Thy Customer’s Data

Garrison Keillor opens the monologue on his weekly radio variety show with “It’s been a quiet week in Lake Wobegon…”  Well, it has definitely not been a quiet month at Chaos Central, which accounts in part for a long silence in this forum.  It began with an ominous question from $CLIENT one Friday morning…

“How do you recover an Amanda backup set manually?”  OK, innocent enough.  Sometimes it is faster to do it that way, and easier than installing the client software to restore data to a different system than the one from which it was backed up.  My response: the answer is found in the header of the first archive file, and you’ll need to figure out how to reassemble a sliced dataset.  Amanda is a popular Open Source backup tool for Unix.  $CLIENT runs it as a disk-to-disk application, which we do as well here at Chaos Central.  USB disks are cheaper than tape jukebox systems by far.  We, meaning system administrators in general, backup data regularly, so as not to violate the first rule of system administration:  Thou shall not lose thy customer’s data.  Sometimes, customers delete files by accident, sometimes they delete files on purpose and later find out that was a mistake, and sometimes disks fail.  We don’t guarantee you can get all of your data back, but, usually, if it existed for at least a day, it’s on a backup tape or disk and we can retrieve it, all things being equal…

On this particular week, things went wrong, in a perfect storm.  The data in question had been archived, which is to say it was not part of the active dataset, so was not being backed up regularly.  The archive was stored offsite, but accessible through the network.  Following the principle of “trust but verify,” $CLIENT had kept a backup copy of the data before it was archived.  Data is like forensic evidence–in order to be trusted, it must be maintained through a chain of custody that is trusted.  And, in the spirit of Rule #1, there must be at least two complete copies at all times.

As feared, the off-site archive copy had become corrupted, and the off-site storage agency had no backup of it.  Not all of the files were damaged, but the missing ones were critical.  So, out comes the original backup, which had been preserved on-site for just this contigency.  However, being somewhat “aged,” the backup index had not been updated to the current software revision level, so file recovery through the normal program interface was not an option.  At this point, we have copy number 1 damaged, and the backup intact but stubbornly non-recoverable.  Perfect storms require multiple unlikely and possibly unrelated events to coalesce, and some definable human error in judgment in dealing with the combination to become memorable bad examples and the stuff of books and movies.  Two unrelated events had converged now, and the ice was thinning rapidly, so to speak.  Back at Chaos Central, we were still blissfully unaware of the “whole story,” but would soon be drawn into the rescue operation.

The primary defense of Rule #1 is to ensure there are two verifiably good copies of any data at all times.  The correct response at this point would be to make another copy of the known good but inconveniently unusable backup before resorting to manual extraction measures.  And, that was the intent.  Except for a tiny flaw in the process, whereby a chunk of the backup (which, as we will see, was stored in individual blocks or slices–140,000 of them) was moved to the test area instead of copied.  At this point, the integrity of the sole remaining copy was compromised, but not yet beyond recovery.

What happened next involved yet another human error, caused first by an imperfect understanding of the exact semantics of the manual recovery procedure (partly due to exceptionally vague documentation) and second, by applying it in such a manner as to write into the directory containing the only copy of the critical first blocks of the archive.  The manual recovery procedure called for using the Unix ‘tar’ (tape archiver) command with a ‘-G’ option, which the manual says “handles the old GNU incremental format.”  Whatever that means.  Sounds innocuous, right?  A lot of these open source tools assume that you might be taking data from one system and importing it into another, and use the lowest-common-denominator functionality by default.  The word “incremental” implies “partial,” right?  So we should be safe using it.  No.  What it does, and what happened when our hapless S/A applied it with the target directory as the same directory containing the two archive blocks, was “remove files in the target directory that are not contained in the archive.”  Which are in the software design notes, but not in the user manual.  In my own cautious way, I don’t usually expand tar files into the same directory that contains the archive, as a matter of principle, so this normally wouldn’t be an issue, but it would have been annoying at best if used for a partial restore in a directory containing other files.

The effect was that, yes the first few files in the archive were restored, but now the beginning blocks of the archive were inexplicably missing.  Inexplicably, that is, until the investigation and rescue operation initiated from Chaos Central discovered the horrible truth about what ‘-G’ does.  The tape archive itself consists of hundreds of thousands of files packed into one, which is then compressed.  Losing the first two pages of the archive lost not only the first few files of the data, but also the dictionary to translate the compressed gibberish into the first chapter.  A simple data recovery operation now escalated into a data repair operation, requiring some advanced skills and much research, and not a small amount of luck.

A bit of research on the Web showed that, yes, the dictionary in a gzipped file is reset from time to time, i.e., in compression blocks, with a general description of how one could, with trial and error, find an intact one in what was left of the broken dataset.  But no readily-available implementation of a solution, anywhere.  So, I wrote one, a short Perl script that doesn’t attempt to find the first block, but the first one that starts on a byte boundary (compression is by bits, not whole bytes, so not all blocks do).

#!/bin/perl -w
# Repair a broken gzip file
# Input file is the tail of a corrupt gzip file.
# This script was written to recover a gzipped tar archive from
# a split gzipped file in which one or more segments are missing,
# such as a corrupt backup tape. Apply this to any segment.
# Script creates file "errmsg.txt"
# and uses newgzip.gz as a working filename
# for the recovery process
# NOTE:  Use this script as a guideline only--
#        adjust to fit your particular conditions.

# create a valid GZIP header (binary)
$ts=time();$header=pack("cccclcca8",0x1f,0x8b,8,0x08,$ts,0,3,"newgzip");

# open test slice and search for possible compression block boundary,
# shifting off first byte.
open(RAW,"<$ARGV[0]");
while ( read(RAW,$buf,32768,0) > 0 ) {
 $testfile .= $buf;
 print ".";  # output a dot for each 32K block to show we're working
}
print "Read " . length($testfile) . " bytes\n";
close(RAW);
while( length($testfile) ) {
 $fbyte = ord(substr($testfile,0,1)) & 7;
 if ( $fbyte == 4 ) {
  $testzip = $header . $testfile;
  open(RUN,">newgzip.gz");
  $wb = syswrite(RUN,$testzip,length($testzip));
  close(RUN);
  system("/usr/local/bin/gzip -t newgzip.gz 2>errmsg.txt");
  open(ERR,"<errmsg.txt");
  read(ERR,$rtn,4096);
  close(ERR);
  $rtn =~ y/\n//d;
  print  length($testzip) . ": " . $rtn . "\r";
  if ( $rtn =~ /crc error/ ) {
   print "Success\n";
   rename "newgzip.gz" $ARGV[1];
   unlink "errmsg.txt";
   exit();
  }
 }
 $testfile = substr($testfile,1);
}
print "Failed to find a valid compression block in $ARGV[0]\n";
unlink "errmsg.txt";

Well, there it is.  It prints out some dots to show it is reading the file, then prints out where it is in the file, so the operator doesn’t get nervous about what it is doing to the data or how long it will take (a fairly long time, as you may search a long ways into the file to find a usable key, and it reads and writes a file shortened one byte at a time).

So, we got a valid chunk of a compressed file that we could graft onto the front of the remaining 139,996 chunks ( we lost two, and had to search through two more before we found a compression block–you will lose some data with this procedure, but we were looking for specific files, and didn’t want to have to shift a terabyte of data bit by bit, so we looked for a block on a byte boundary).

But, now that we could decode the data, it was still gibberish, because it started in the middle of a file.  Fortunately, there was a Perl script available on the Internet (search for find_tar_headers.pl), which worked “out of the box” to find the start of the next file.  In this particular archive, we lost 30MB of compressed data off the front, and another 400MB of uncompressed data.  Fortunately, the files we needed to recover were further down in the archive.

For the remainder of the process, we added a dummy Amanda header to the recovered Gzip archive block to simplify the script we wrote that stripped off the Amanda headers from all of the chunks, concatenated them together (after discovering that the last block on each “tape” has to be discarded, as it truncates when EOT is reached and is restarted on the next tape), unzipped them, used ‘tail’ with an offset to strip off the unusable partial file from the front, and then extracted the files from the ‘repaired’ (but incomplete) tape archive.  The script ran for about 20 hours, working on the repaired 1.4 Terabyte backup dataset.  The files we lost off the front end were, fortunately, intact in the primary copy of the dataset, so we preserved Rule #1, not entirely by skill alone.

But, this could have been prevented, obviously, by following a few simple steps:  first, as it says on the cover of the Hitchiker’s Guide to the Galaxy, Don’t Panic.  Yes, the customer wants his data “right now,” but, given the choices of “soon” or “never,” I’m sure he would prefer the former.  Take time to plan each step of the recovery process, and carefully make a new plan if the first does not succeed.  Second, don’t use backup software to create archives.  Backups are meant to keep a copy of “live” data, that supposedly gets checked often enough to be correct.  And, the backup is refreshed from time to time and managed by a directory. Archives, by nature, preserve at least two static copies for an indefinite time, and require quite different directory and validation/verification processes.  Third, if one of the two copies of your data is outside your immediate chain of custody, it isn’t valid–you need a third copy.  Fourth, if you need to experiment with the data, experiment on a copy, not the original.  Fifth, if you are in a data recovery operation where only copy exists, use a team approach and double-check every step.  Plan what you are going to do, understand what the expected results of an operation will be and make sure the operation is repeatable or reversible.  And sixth, research your options carefully.  We learned that damaged gzip files and damaged tar files are at least partially recoverable, but at considerable expense in time and effort.  Above all, be careful.  Rule #2 says “If you violate Rule #1, be sure your resume is up to date or you have another marketable skill.”

2 thoughts on “System Administration Rule #1: Thou Shall Not Lose Thy Customer’s Data”

  1. Thanks, just put a button on there. Need to be in single-article display to see it, above the article. Click on the article title if on general blog display.

Comments are closed.