The Reserve Bank of Australia (RBA) has issued a comprehensive autopsy of one of its rare outages.
The document, issued this week, details how a total electrical shutdown of the central bank’s headquarters and data centre – including all back-up power supplies – hobbled parts of its critical RITS system for around three hours.
Despite the best laid plans to limit any unanticipated incident to just two hours (worst case scenario), it appears that part of the recovery delay that brought the central bank perilously close to delaying a welfare payments run in August 2018 flowed from a confluence of super-high security … and just plain bad luck.
The 67 page tome is a brutal lesson on how fire control system control test can go very horribly wrong when a facility is purposely hardwired to prevent intrusions of all kinds (literally a Fort Knox scenario).
Three hours offline might not sound cataclysmic. But if you cut all power to a central bank at around 11am in the middle of a busy trading day when a pension run slated for the evening, it really doesn’t get much worse.
The outage economy
When the RBA’s big iron tanks for a protracted period, especially its core Reserve Bank Information and Transfer System (RITS), it isn’t just bad; it’s potentially economy-outage bad in an age of electronic inter-dependence.
Central banks, as their name suggests, are a sovereign hub that keep the other banks between the lines and make sure their ledger is square at the end of the trading day.
In the case of Australia, the RBA is also a transactional bank for the government and it’s understood the August 2018 outage narrowly dodged missing a Centrelink run. Then there’s equities markets, currency markets, property settlements and more tapping in.
“Shortly before 11 am on Thursday 30 August, the Bank experienced a disruption to the power supplying the data centre at one of its sites. The outage was caused by the incorrect execution by an external party of routine fire control systems testing in that data centre, which initiated an unplanned shutdown of all primary and back-up power supplies supporting the data centre. The power loss abruptly cut off all technology systems operating from that data centre, including those supporting RITS,” the RBA’s assessment states.
Heritage 1960's kitsch. What's not to like?
While a heap of low value settlement batches from eftpos and Mastercard had run at 9am, pretty well everything else hanging off RITS was hit when the power went off. The priority for restoration was immediately the Fast Settlements Service (FSS) that pumps the New Payments Platform (NPP).
The FSS was stood back up in three hours, but the benchmark the RBA is tied to is two hours courtesy of the Principles for Financial Market Infrastructures. If you help set the standards, you also live by them it seems.
But here’s the bit the RBA admits it wasn’t expecting.
“RITS services took longer to recover than this recovery time objective (RTO) because of the scale of the event, the loss of all ancillary support systems and difficulties in technicians gaining privileged access to systems. Loss of access to documentation systems that store support procedures also impeded the effectiveness of staff working to re-establish RITS,” the autopsy says.
Tolerable downtime? Try 26 minutes-a-year
As we mentioned before, the two hour downtime is meant to be as bad as it ever gets. Normal running targets for the FSS is actually three digits back from the decimal point.
“Since the FSS is required to settle real-time payments via the NPP on a 24/7 basis, the Bank has set the availability target for FSS at 99.995 per cent (compared with 99.95 per cent for RITS), which equates to an average of around 26 minutes of allowable downtime per year,” the RBA noted.
But when the maelstrom hit, there were necessarily some hard choices to be made.
“Consistent with this, on 30 August the Bank’s executive management gave priority to recovering FSS before commencing the recovery of RITS. Ordinarily, this would not cause a material delay to the recovery of RITS, since FSS is designed to recover automatically when one of the sites becomes unavailable. However, due to a combination of factors, not all systems recovered as expected,” the report continues.
“The large-scale loss of supporting technology services and related delays in gaining immediate access to highly secure systems to diagnose the issue and restore services meant that full recovery of FSS took three hours.”
This said, clearing of NPP payments continued through the unscheduled blackout with some NPP connected banks and services activating contingency plans “to make funds available in beneficiaries’ accounts for lower-value payments ahead of FSS settlement resuming or took steps to re-route the clearing of customer payments through the direct entry system.”
The view from under the bus
It then became a question of what to reboot first, and what had to wait.
“At the time of the outage, RITS was operating from the affected site as were the servers that automate aspects of the failover of the RITS database. The prioritisation of recovery of FSS caused a delay in commencing work to restore RITS, while the loss of access to RITS monitoring services meant that Bank staff were initially unable to identify the state of RITS operations,” the RBA said.
“Four hours after the power failure, the RITS queue at the alternate site was brought back online and queued transactions began to be settled at this time.”
As for the RBA’s phones, they got slammed too, so the comms plan went to mobiles and SMS.
In terms of being a near miss, the RBA is candid that things could have been a lot worse.
“The potential impact on participants and the broader financial system was greatly diminished by the recovery of systems and completion of settlement on the day of the outage” … which is a very dry way of saying pensions and benefits not landing is an all-round bad place to be.
And next time?
But could it happen again? Not if the RBA can help it. First stop on the tour of “key themes among these lessons learned and follow-up actions” is keeping a sharp eye on what can trip and when to test.
Like, you know, maybe not mid-batch on pension day.
“The Bank has conducted a review of maintenance arrangements for critical infrastructure across all sites (including the data centres) and adequacy of fire safety system testing procedures and controls.
“This has addressed the root cause of the outage, and reduces the risk of another maintenance incident impacting the availability of RITS and other critical services during core operating hours. The Bank has also expanded the range of its technical contingency testing scenarios in order to better simulate events in which several system components lose power simultaneously.”
Three times for bad luck
If resilience means more real estate, well so be it.
“The Bank plans to move a server that supports the automated failover of the RITS database to a third site, to remove the risk that this server is also impacted by the same contingency that affects systems at a production site,” the RBA said.
The central bank also reckons it’s “identified the issue that prevented the automatic failover of FSS on 30 August, and has implemented a software update that addresses this issue. Prior to this the Bank had put in place an updated manual procedure to allow IT staff to quickly respond if a similar circumstance arises again.”
We suspect that could mean some paper, a torch and an extra set of keys. A footnote to the above statement reveals some interesting detail.
“A repeat of the combination of factors that obstructed automatic failover on 30 August, which resulted from the complete loss of power at the precise moment that a particular process was running, is considered highly unlikely.”
But, as experienced the hard way, Murphy's Law applies.
In the meantime, the RBA is going ahead with a bank-wide resilience review and a cyber resilience review.
And, unlike many of the retail banks it deals with, the RBA is being candid not only about its faults, but how it’s fixing them. It's not such a bad example to follow.