A couple of weeks back, amazon web services went down. If you’re in the web business, you probably heard about it, you may even be in the rare group of people who read Amazon’s press release which described what went wrong, but if you noticed foursquare or one of a bunch of other high profile sites going a bit slowly for a few days, you probably saw the problem first hand.
Amazon’s web “cloud” computing service is famously very reliable, it has data spread across multiple data centres organised into multiple regions, but even with all these safeguards in place, there was still a means by which a number of failures could bring down a substantial proportion of their service.
If your organisation is looking to the cloud to save money, be more flexible or use new tools, you’ll be understandably anxious about what happens when things go wrong. It’s useful to take a pragmatic approach in this scenario. IT people love to design systems with multiple backups built in and automatic handlers for failures, and they’ll love to try and work out exactly how reliable and automatic the end solution actually is, but at the end of the day there’s only one way to be sure and whilst your IT people might not be very comfortable with it, it’s worth a try.
I first read about this a while ago in some IT publication or other whose details now escape me. I read the story that a new Chief Information Officer, on starting work with a new team in a day long workshop, was briefed by technician after technician about how all the servers in the business were reliable and backed up and would resolve problems automatically when they occurred. He kept asking one question: how sure are you that this will work when things go wrong. Of course the technicians were very confident that things would work and kept telling him so. When the technicians returned after lunch, they found in the middle of the conference room table a block of wood with a fire axe imbedded in it. The new CIO said to the technicians “Some time later today, I’m going to put that axe through one of the servers in this building, how confident are you that things will keep on working”. We can anticipate that there might have been some further double checking happening after that.
Of course I’m not advocating that you destroy your equipment or risk life and limb, but it is very easy to simulate failures in computer equipment: you pull the plug out. When we commissioned a major new system recently, we deliberately worked to build in the very highest level of automatic backup and failure management with the expectation that if any device in the setup failed, the rest of the system would seamlessly pick up the slack. After all the specifications, sign-offs, commissioning and testing, everyone was very confident. When I asked to go and pull some plugs out, however, I was met with some resistance, and a certain amount of double checking. In the in end it was a very worthwhile exercise because I know exactly what happens when there’s a hardware failure in this key system – not all that much – the other devices sort the problem out and then get a human to come and replace the broken bit.
Of course Amazon’s problem was much more complex than this, but when it comes to your IT, how confident are you that you could pull plugs out? If you can get the IT team to the point where they’re happy to do this on a live system, you’ve come much further than most towards a really reliable IT provision.
Martin Campbell @ 14:30