Monday, August 22, 2011

The difficult thing about back-ups


My back-up drive failed recently.

It was a lucky coincidence that I noticed this; despite grand intentions I’m lazy when it comes to backing things up and only think about it when I am panicking about typing ‘rm –rf’ in the wrong directory.

The drive failed when I was moving it across my study; I powered it down, unplugged it, moved it, plugged it back in and then fired it up again. Except it declined to fire up. It made a few pathetic wheezing sounds and gave up. After a few days of online searching I admitted defeat and contacted Lacie who eventually acknowledged it was faulty and issued an RMA. Four weeks later I had a repaired working drive in place (albeit having lost all of the original data). I am now able to return to my previous state of blissful ignorance.

The point of this little story? In a nutshell I think this summarises many enterprise’s attitude to back-ups. I know from personal experience of two organisations who notionally had a standard back-policy with regular full and incremental back-ups, which when the back-ups were needed, could not be retrieved because the back-up hardware had failed.

Then there is the recent case of Amazon. Running infrastructure as complicated and sophisticated as they do is fraught with risk so if I was one of their customer’s I would probably be thinking about having an iron-clad business continuity plan in place, but apparently even here back-up complacency reigns.

Why is this a big deal? Normal production systems are tested thoroughly prior to go-live and then tested on an on-going basis through production use. Any problem will be automatically detected or signalled by a user fairly quickly. However since back-up systems are only invoked by exception, the first time you know there is a problem is when you need them. The answer? Well, regular testing of a full restore from back-up seems like the obvious solution. There are some products on the market which claim to help but I remain somewhat sceptical of their efficacy.

Am I eating my own dog food? Alas I have regressed to my pre-failure days and have adopted the macho “my hardware never fails and I never type rm –rf by mistake” attitude. Some people never learn.