Reserve Bank of Australia has published a post-incident report for a real time payments outage in October.
The outage lasted from 7pm on October 12 into the early hours of October 13, and delayed hundreds of thousands of payments, some of which took more than five days to clear.
At the time, the bank said a change to the software that provisions its virtual servers experienced an “operational error”.
“This error triggered a process that disrupted a significant number of servers in a random pattern over a period of approximately 25 minutes," the RBA wrote in its post-incident report [pdf].
The scale of the outage was exacerbated by “a failure to comply with the RBA’s technology change management policy”, the review said, while “control gaps associated with the virtual server solution design contributed to the rapid propagation of the error.”
Redundancy features of the Reserve Bank Information Transfer System (RITS) and the Fast Settlement Service (FSS) kept some systems running, while “some services became unavailable and the resilience of the system was severely degraded."
“The scale and haphazard pattern of disruption significantly complicated the incident response”, the report said.
The report noted that there’s a lack of real-time monitoring of FSS transaction performance, noting it “took the RBA too long to determine the extent of settlement aborts occurring."
“The RBA will investigate, and where necessary implement, improvements to its monitoring that could have detected this and discuss options with NPP Australia as to whether its participant communication options can assist," the central bank said.
Recovery procedures will also be reviewed, because “observing the principle of ‘first do no harm’ likely prolonged the recovery time”.
More “timely” and “assertive” action restarting a component responsible for FSS settlements notifications would have helped, the report said.
Visibility was again cited as a problem: “If better information on the severity of the impact had been known, a more positive approach to system recovery may have been adopted earlier."
“This could have included self-suspension of the FSS [payment gateways] to halt NPP payments, which would have reduced the need to manage aborts and orchestrate subsequent retries and manual reconciliation.”