How a 50c fan pushed Teachers Mutual Bank to active-active

By Ry Crozier on Mar 23, 2015 6:43AM

Core banking outage forces data centre overhaul.

It was a normal Thursday afternoon in October 2013 when Teachers Mutual Bank CIO Dave Chapman was called out of a meeting.

Waiting outside was the bank's CEO and deputy with some unwelcome news: core banking services were dead.

The problem was initially traced to a power supply failure in a disk tray in the core banking system storage array.

"There's always two power supplies in a disk tray, as I'm sure most of you are aware," Chapman told delegates of the Australian Data Centre Strategy Summit.

"If one fails the other one is supposed to cut in - and it did."

The failure caused an automated note to be sent to the bank's infrastructure team, alerting them so they could request a replacement power supply be shipped out.

"Normally you've got a while," Chapman said. "Your SLA is four hours to get that replaced, etc."

Except in this case.

"It turned out the fan on that second supply [in the disk tray] was faulty - a 50-cent half-amp fan," Chapman said.

"The fan died straight away, the second power supply heated up, turned itself off and our core banking system went straight down.

"This was only a few minutes after the alert to infrastructure that the first power supply had gone."

To complete a perfect storm, the parts supplier's logistics system was down when the bank's infrastructure team tried to order the replacement part (not realising it was the only failed component).

"When it was communicated through to their logistics department that Teachers Mutual Bank needed a part, a digit was transposed and so the wrong part was delivered some four hours later," Chapman said.

"Overall it took 27 hours for that part to arrive onsite, by which time we'd well and truly cut over to our disaster recovery option."

Though the bank had been able to switch to its disaster recovery site, the cut over took seven hours.

"That seven hours wasn't process - that bit was pretty easy," Chapman said.

"About four hours of it was copying data to and from the primary and secondary data centre, working out what's still working, reconciling all the data and getting it right."

Chapman knew the answer. The bank already had data centres in Western Sydney, about 30km apart - it just needed to operate them in a different fashion to the primary-secondary model.

"We realised that we were really close to an active-active environment if we could only beef that up," he said. "So we went there."

The ingredients for creating an active-active environment included upgrades to network, storage and server configurations.

Deploying dark fibre

The bank replaced a 200Mbps MPLS network connection with Telstra dark fibre (with up to eight 10Gbps links per fibre) that connects the two data centres and the bank's Parramatta home loan office.

The upgrade provided not only the bandwidth necessary to run the data centres in an active-active mode, but also a solution to bottlenecks in the existing environment.

"With the previous network, user experience was impacted all the time because we had replication happening over the same pipe as normal core banking," Chapman said.

"You can't really control a lot of that replication - it just cuts in when it's needed. So everyone was working away in the middle of the day and suddenly one of the replication tools cut in and started pumping data across that pipe."

The new dark fibre loop has already averted issues that may have caused outages before.

"I had to send a note to the COO saying 'just wanted to let you know a switch just died in [one of our data centres] and had we not done the stuff that we as the executive approved a while ago, the Parramatta home loans office would have been offlineT for the day'," Chapman said.

"But it just found a redundant path around [the fault], went the other direction and no one knew it was broken".

The bank is planning a significant storage refresh - moving from EMC arrays and badged Brocade switches to HP arrays and HP-badged Brocade switches - to take advantage of features such as de-duplicaton, compression and zero stripping to reclaim "critical storage space".

Nothing will change on the hardware side, but the company will shift to having a single vCenter server across both sites, "giving us a VMware Metro Cluster".

The biggest change, however, was disaster recovery and replication.

"At the end of the day it's really altered the way we think about DR and business continuity planning," Chapman said.

'DR's no longer an issue for us because we're automatically [protected] whichever site we're on.

"We don't have to worry about all the things we used to worry about."

Got a news tip for our journalists? Share it with us anonymously here.