Thursday, September 16, 2010

When Systems Fail

This week I was a direct victim of a systems failure that set me thinking about how even mundane activities that we have been doing for several tens, if not hundreds, of years, like checking into a hotel  in this case, rely on systems that we take for granted and which, when they fail, throw everything into complete chaos.

It's a long and not particularly interesting story but in summary I checked into one of the large chain  hotels, which I use a lot, only to find when I opened my room door that the room was in a state of complete chaos and had clearly not been visited by housekeeping that day. On trying to change to another room I was told the system had been down since 4am (it was now 8pm) that morning and the staff could not tell what state rooms were in. Clearly not a great state of affairs and not great for client relations (there were a lot of grumpy people queueing in reception, some of whom I would guess would not be going back to that hotel). So what would an architect of such a system do to mitigate against such a system failure?
  1. I don't profess to know too much about how hotel management systems work and whether they are provided centrally or locally however I would have thought one of the basic non-functional characteristics of such systems would have been a less than one hour recovery following a system failure (not 16 hours and counting). Learning point: Clarify your availability non-functional requirements (NFRs) and realise them in the relevant parts of the system. Maybe not all components need to be highly available (checking in a customer maybe more important then checking her out for example) but those that are need to be suitable 'placed' on high-availability platforms.
  2. There was a clear and apparent need for a disaster recovery plan that involved more than the staff apologising to customers. Learning point: Have a disaster recovery policy and test it regularly.
  3. A system is about more than just the technology; the people that use the system are a part of it as well. Learning point: The architecture of the system should include how the people that use that system interact with it during both normal and abnormal operating conditions.
  4. Often NFRs are not quantified in terms of their business value (or cost). When a problem occurs is the impact to the business (in terms of lost revenue, irate customers who won't come back etc) really understood? Learning point: Risk associated with not meeting NFRs needs to be quantified so the right amount of engineering can be deployed to address problems that may occur when NFRs are not met.
Formal approaches to handling non-functional requirements in a systems architecture are a little thin on the ground. One approach suggested by the Software Engineering Institute is through the use of architectural tactics. An architectural tactic is a way of satisfying a desired system quality (such as performance) by considering the parameters involved (for example desired execution time), applying one or more standard approaches or patterns (such as scheduling theory or queuing theory) to address potential combinations of parameters and arriving at a reasoned (as opposed to random) architectural decision. Put another way an architectural tactic is a way of putting a bit of “science” behind the sometime arbitrary approach to making architectural decisions around satisfying non-functional requirements.

I think this is a field that is ripe for more work with some practical examples required. Maybe a future hotel management system that adopts such an approach to during its development will allow a smoother check-in process as well.

1 comment:

  1. Maybe they had figured (rightly or wrongly) that losing the repeat business of a proportion of the occasional hotel-full of customers costs less than the costs of implementing a full DR and contingency plan?