[H-GEN] Tape Drives

James McPherson - TSG Engineer James.McPherson at Sun.COM
Sun Jan 6 19:00:37 EST 2002


[ Humbug *General* list - semi-serious discussions about Humbug and  ]
[ Unix-related topics.  Please observe the list's charter.           ]
[ Worthwhile understanding: http://www.humbug.org.au/netiquette.html ]


On 06 Jan 2002, 05:31:57 PM Robert Brockway wrote:
[snip]
> It is possible to make a restore procedure so straight forward that you
> can do it easily under pressure when dog tired, or that a junior admin can
> do it because you're on holidays in the Amazon Basin.  The procedure has
> to cover these eventualities because they are real eventualities.

Experience with many types of disaster recovery helps to clarify your ideas
on what needs to be done ;| Fr'instance, an inadvertent rm -rf on a directory
above the one you want to work with (shared of course), where the permissions
are commonly set to 0777 (rather than 1777)... 60Gb of data later and the
admin discovers that he has to go back _two_ weeks rather than one because
his last full backup didn't quite work.... 

> To be honest I'm not sure why people seem so resistent to the idea that
> backups should be simple.  Too often I see people implement elaborate
> backup routines (Jason none of this applies to you guys) but don't regard
> how they will restore under optimal conditions, let alone poor conditions.

I've come across this also. Sometimes it's easy to get things back up and
running, sometimes it isn't. Of course, you do occasionally get the customer
who actually has a very simple, efficient and effective backup regime. However
it seems to be an instance of Sod's Law that when these guys-n-gals call in
they are _really_ in trouble.

> A restore should never get more complicated than having to boot off the
> install media, pull a few tapes & restore the needed data.  Having to have
> special apps installed just to go a restore (which seems to be the case
> with certain commercial backup packages) or having to rebuild the OS just
> to be able to restore (which is the case with at least some NT backup
> packages) doesn't cut it. That's why I recommend people do DR tests.  Get
> an old box with compatible h/w and try to restore the server to it.  See
> if the logic fails. 

At my first job in Sydney we had our own home-grown scripts running on the
major compute- and file-servers for the university. We made sure that we 
formally tested our procedures in a controlled manner at least every 6 months
(ever wanted an auditor's tick?). Since coming to Sun I've learnt more about
DR than I ever really wanted to know: if you use netbackup and a nbu server
dies - rebuild the OS on the server, reinstall nbu and then read in your tapes
from the library. Of course, the level of effort that you put into designing
your dr system and solution depends on how much you can get done for you by
using a pre-built package (amanda, netbackup, networker, arcserve et al), and
what sort of user environment you have to configure. At UQ Library I wrote
scripts around amanda and used two dds3 tape drives. At Macquarie I tuned the
existing scripts (ufsdump is good, ufsdump is great) for new machines and changes
that the dbas made (oracle hot backups are fantastic for DR), and we used a 
combination of exabyte 8mm exb-8505xl, dlt4000 and dlt7000 drives. We use large
tape libraries internally here, and we don't script the backups ;| 

That's another thing - training for your sysadmins and operators is essential.
Having written scripts around amanda, I then had to spend hours writing doco
(it was only about 10 pages but didn't include any low level details like Jason's
doco) and another two and three hours training the operators. If we had used
something like Solstice Backup (rebadged Legato Networker) or Veritas NetBackup
then the commercially-operated courses would have been mandatory, as would a test
system for the operators to get comfortable with. 


> To sum up my opinion on this: Backup procedures need to follow the KISS
> principle more than most things.  They need to be simple or their
> complications will come to bite you at the worst possible time.
> A general assessment of the network as a whole may be needed to ensure the
> backup procedure is rational. Eg, do you need seperate tape units on each
> server?  Can you backup several systems across the network to a single
> tape?  If you do, how do you restore that data to the (remote) system
> without having to jump through hoops to do so?

Other questions to ask:
If you have several servers to back up, each with company-critical data on them,
is it worthwhile acquiring a tape library and fast (100FDX or 1000FDX) network
connections to centralise the backups? How quickly must you be able to restore
the data to get a minimally working (ie, for end users) system available? 

One final thing - we once had a hot call from a customer in Sydney - they had
suffered a three disk failure in their array (software raid-5 btw), and after 
we replaced the disk using the standard procedures they were unable to get their 
database back online. The customer's response to the inevitable "where are your 
backup tapes" was the immortal

We don't need backup tapes, we've got our data protected by Raid-5.


they got the db back online 7 days later ;|


James 

-- 
TSG Engineer (Kernel/Storage)           828 Pacific Highway
APAC Customer Care Centre               Gordon NSW 
Sun Microsystems Australia              2072

Failfast panic: those controlling voices in my head have 
stopped telling me what to do.....

Read about the VOS Initiative at http://www.vosinitiative.com


--
* This is list (humbug) general handled by majordomo at lists.humbug.org.au .
* Postings to this list are only accepted from subscribed addresses of
* lists 'general' or 'general-post'.



More information about the General mailing list