I think that we in the architecture work often overlook a concept that can be incredibly useful in helping our stakeholders reach well informed decisions. It’s a concept that security architects and others in that space are well versed in, but which I believe can be used much more universally in architecture and that is the concept of risk.
To explain, I recently needed to provide some consulting for an organisation who had a fairly significant decision to make with respect to their ERP solution. They were unsure if the software solution was right for them going forwards and the hardware also needed to be upgraded if the software solution was to be kept.
As it turned out, the software was meeting their needs well, however there was a challenge with the hardware platform. The problem was that when I looked at how the solution was deployed, it was deployed on a single server which of course opens questions around availability and Disaster Recovery (DR).
The simple (lazy) thing to do would be to take the as-close-to-zero risk approach and say “you need to implement a second server, or migrate to the cloud or some other strategy that means you have high availability and DR”. The problem for them though was that this would have created a fairly significant extra cost. So how to decide what to do?
I think I’ve already made it fairly clear that the tool I used in this case was a risk based approach. In particular, I used the TOGAF Open FAIR risk analysis framework. So how did I go about doing this?
Firstly the risk I was trying to understand was the risk associated with a disaster recovery event. Even though their ERP is a core system, in their case high availability was not actually a quality attribute/non-functional requirement due to their ability to still operate for a period of time without access to the ERP and their ability to source hardware easily. DR was a risk that did need to be explored however because (for example) a fire could put their single server out of commission for an extended period of time.
Open FAIR provides a very nice taxonomy and framework for thinking about risk, and in the case of DR, it gave me the ability to break apart the Loss Event Frequency from the Loss Magnitude (see the framework if you want the technical descriptions of what these mean) and calculate an actual dollar figure that represented the annualised cost of a DR event taking place. This could then be compared against the estimated cost of purchasing and operating additional hardware or migrating the solution to the cloud.
This is perfect for communicating with executive and board level stakeholders because even though the annualised figure is based on a set of assumptions, these are explicit and the logic is clear and it means they can compare as much as possible apples with apples. They are also well versed in the concept of risk.
As it happens in this case, even though the cost of DR event could be quite significant were it to occur, once the actual likelihood of this is taken into account and annualised it becomes clear that the financial justification for investing in additional hardware or migration to the cloud just wasn’t there.
This of course wouldn’t be the case in a lot of other organisations, but for them it was and it wouldn’t have been clear without taking such a risk based approach and using a framework like Open FAIR. From this experience, it occurs to me that a similar approach could be used in plenty of other situations where a decision is important but the preferred option is non-obvious.
I’d be curious to know whether others have followed this type of approach and the success they’ve had doing that.