The case of N-1 generators


7 October 2015
Ed Ansett of i3 Solutions Group

How did a pneumatic seal the size of a 20-cent coin bring down a data centre carrying a third of the world’s Internet traffic?

This was one of the case studies presented by Ed Ansett, chairman of i3 Solutions Group, during a standing-room-only presentation at the recent Data Centre Week conference in Singapore.

The case of N-1 generators took place on a hot day when the data centre was running at full load - 7.2MW - with four 2.5MW generators installed, so it was N+1 configured.

Reconstructing the sequence of events that took place, Ansett described how one generator failed to start – a 1 per cent probability that was realised in this case. The data centre was now N configured, and running on three generators. About 30 minutes later, one of the generators failed. The data centre was now N-1 configured, with 5MW capacity supporting a 7.2MW load.

The remaining generators were overloaded in 60 seconds, and power to the cooling systems was lost. The data centre ran on UPS for another 30 minutes, and then there was a total shutdown.

The fallout was significant. The outage caused massive global disruption to the Internet. It took 6 hours for utilities to come back on, and another 8 hours before the data centre was fully restored. , The data centre failure led to litigation and financial penalties, and resulted in reputational damage.

“So what have we learned? Why did the two generators fail?”

As Ansett pointed out, data centre failures are often the result of two or sometimes three events happening round about the same time, that are usually not foreseen.

In this case, two generators failed, and they failed for the same reason. As he explained, the generators used a pneumatic starter, and while the generators themselves were well maintained, the pneumatic starting system was not. So what happened was that a failed high pressure seal led to pressure loss in the pneumatic system.

How can data centres prevent something like this from happening again?

Ansett pointed out that the system itself had resilience. If the operations team had been trained to use the pneumatic system bypass, they would have been able to divert air from another source to start the generators up. However, they were not trained to do so.

Management is often complicit in these failures, he said. They react to cost pressures by reducing the operating budget to a point where people are not sufficiently trained.

Drifting DRUPS and transformer core saturation

Another example cited by Ansett involved a data centre for financial services trading. The IT load at the data centre was served by a dedicated 2N power system using a dynamic UPS.

Prior to the failure, a significant 100ms utility event occurred, and this caused the diesel rotary uninterruptable power supply (DRUPS) to start. During the process, one of the DRUPS lost frequency control of the output voltage, and the DRUPS output frequencies started to drift apart.

“They malfunctioned but they did not fail,” Ansett pointed out. But this created an out of phase condition which was presented to the static transfer switch (STS), which transferred the load to an alternative source.

As a result, the transformer core became saturated. The voltage output sagged for 200-300ms, the IT load voltage dropped to an unstainable level, and the power failure led to a complete data centre failure and a halt in trading.

So what went wrong? Ansett pointed out that the causal factor – the transient utility event – was not really a factor. “We expect utility events to occur; we expect DRUPS to lose frequency control.”

The root cause, he said, lay in the STS settings. “We expect the out-of-phase condition, so these should have been designed and tested for, but the STS was not enabled for that kind of event. It could have been. But not set up correctly.”

To prevent such a failure from occurring, the STS delay transfer function could have been enabled such that the transfer will still take place, but not in such a way that it would saturate the transformers, said Ansett.

So whose fault was it? Part of it has to do with the design, said Ansett. “Designers must always be cognizant of using the STS delay transfer function.”

There was also a failure on the part of those commissioning the system to test for the out-of-phase condition, said Ansett. “Commissioning didn’t check the STS delay setting or test the STS out-of-phase input condition.”

The importance of learning from failures

Other scenarios cited by Ansett in his presentation included the use of residual current devices in the data centre without clarity on the values at which the devices actually trip; and the noise generated by high pressure gas fire suppression systems in the data centre, which has been linked to HDD failure.

In Ansett’s view, it is very important to share information about these and other data centre failures – something which the industry is not doing. “They get people to sign non-disclosure agreements, they don’t share information, and so we are not learning from failures.”

And these failures are recurring. “I keep seeing the same problems occurring from data centre to data entre to data centre because we are not sharing information very well.”

He contrasted the data centre industry with sectors such as aviation, which are the “complete opposite” possibly because human lives are involved. Unlike those sectors, the data centre industry is not regulated.

Data centres are a subset of mission critical systems such as petrochemicals, aviation and nuclear systems, but it is relatively young, said Ansett. There is currently no authority overseeing it and telling it what it should be doing. The closest it gets are the safety guidelines set out by government utilities regulators or the mandates laid down by financial authorities, but the industry per se is not regulated.

Today, the confidentiality that surrounds data centre failures continues to create “massive problems” for the industry, said Ansett. “They will say, ‘Tell us what’s wrong; tell us how to fix it; but don’t talk about it.’ But how do we learn if failures are not shared?”