Risky Business - What Could Possibly Go Wrong?

Nigel Adams

May 12, 20195 min read

Updated: May 13, 2019

Service failure is inevitable, but the impact can be mitigated.

Benjamin Franklin clearly never ran services. If he had, he would have amended his famous quote “...in this world nothing can be said to be certain, except death and taxes” by adding "and service failure".

Over the last few weeks I’ve watched two organisations, which suffered large scale service failure, being pilloried in the press and suffering the ignominy of political intervention.

In the first case, TSB - a UK challenger bank with 5m customers - migrated 1.3 bn customer records to a new platform, after years of planning. Despite declaring victory on the weekend of the migration, subsequent headlines told a different story.

In the second case, Optus - Australia’s second largest telco - heavily promoted a World Cup streaming service. Unfortunately, it didn't quite work out as planned.

It may be that there are the operational equivalents of the "firefighter by day, arsonist by night", who prefer the adrenaline rush of managing incidents over the serenity of stable, in-control processes. However, I believe that the vast majority of ops leaders break out in a cold sweat at the mere mention of this type of failure.

Both named organisations are large, have significant resources at their disposal, had time to plan, a clear outcome in mind, put 10s if not 100,000s of hours into the project and had, presumably, what was considered to be good governance in place before "go live". And yet both organisations have suffered extensive levels of reputational damage due to service failure.

What this reinforces to me is that managing and taking accountability for operational risk is one of the most fundamental requirements of an ops leader. It cannot be outsourced. A reliance on an assurance, 2nd line risk or audit team to identify the failure points is both futile and fanciful. It is only those who work with a process every day that truly understand what can and cannot go wrong. And when It does go wrong, it’s the ops and tech teams that fix it, not the risk teams. So, better get it right first time or you’re creating a whole heap of extra pressure and work for your team.

The examples above were clearly complex projects and I both empathise with those involved and do not want to trivialise the problems they faced. I cannot exclaim “if they’d only done this” as I have no more understanding of the underlying problem than has been published in the media.

However, there are some things that spring to mind around risk culture and risk mindset that I feel is lacking in many organisations.

For me there are 6 non-negotiables when it comes to managing operational risk:

Enterprise-wide Process Governance Model: I've always found it surprising how few senior execs have been able to answer how many processes they are accountable for, when processes are the primary building blocks of an organisation. Everyone should be able to "see" the processes, who is accountable for them, which locations they are operated from, who acts on them, which customer journeys and services they support, how they are linked end-to-end, what systems they use, what risks are associated with them, what controls are in place to mitigate the risks, how the high level processes are linked to detailed work instructions and the relevant metrics that describe how they perform. This may seem a daunting task but it can be chunked down and, remember, if you don't do this, how can any senior exec hope to have visibility and control over their organisation?
Failure Modes and Effects Analysis: Operational risk events never happen at the high level. They happen as a consequence of poor process design in the detail. To truly understand operational risk, those closest to the process (not the senior execs) need to document every way every process step can fail. Then, for each failure mode, assess the likelihood of it happening, the probability of detection and the impact if it does happen. Armed with this data, you can not only prioritise which failure modes to tackle first, but also do so knowing that it is linked to the high level risks through the model in point 1. While there are always the “unknown, unknowns” the process of going through this form of analysis is a significant step forward and, if it isn’t time consuming and arduous to complete, you’re not doing it right.
Embedded Process Design Principles: Now that you have prioritised which failure modes need work, the first choice is to re-design them so they cannot fail. This is another source of surprise for me - why do so many organisations allow people with zero qualifications in process design to modify and change them? They are the organisational equivalent of weapons of mass destruction. At the very least, there should be a clear set of actionable, process design principles that are adhered to when any process is changed.
Real Controls & Mitigants: It’s not always possible to design processes so they cannot fail, so the next step is to determine how you can contain the damage if a process fails - like an operational insurance policy. Whether it's through laziness or lack of knowledge, I have seen far too many controls designed as after-the-fact checks and remediation actions such as “train the operator” - which is fine for the next few weeks until they move on to another role and you are left with the same problem. True mitigants are not susceptible to the vagaries of who’s turned up for work today.
Scenario Playbooks: It is both practically and commercially impossible to try and prevent all failure modes. In some cases, the lowest cost option is to devise effective response scenarios. Creating a playbook, which focuses on triaging the problem quickly and effectively, providing a range of pre-prepared and rehearsed responses, dramatically increases your response time. There is a famous anecdote from the 9/11 war room, when the emergency response teams stated that they hadn’t trained for this type of event. The response was - “yes we have, it’s all of the scenarios at the same time!” Whatever you do, once you have developed the scenarios, make sure you schedule frequent dry runs and dress rehearsals.
Monitoring & Analysis: The final step is to ensure you have both appropriate “listening posts” in place, monitoring your systems, your queues, your partners, quality, availability of people etc. and a way to measure your performance in this area. Mean Time to Fail and Mean Time to Recover are two metrics that I’ve found particularly useful. The former should increase the better you are at root cause analysis and fixing issues at source. The latter should reduce as your ability to diagnose, respond and resolve improves.

Managing operational risk is not a “tick the box” exercise. In many ways it is a very tangible demonstration of operational capability (or lack thereof). The six points above may not have prevented the two spectacular failures above or have sped up their recovery, but if every operational leader is clear about their accountability and aware of some of the simple tools available to help, the chances of negative outcomes and slow recovery are greatly improved.

Risky Business - What Could Possibly Go Wrong?

Recent Posts

Comments