The true culprits of data centre failure


20 October 2014
Wong Ka Vin of 1-Net

Bad practices are the biggest culprit behind data centre failures. Operational lapses, processes that are not documented, inconsistency between the day shift and the night shift and failure to hand over information correctly from one point to another – these account for more than 70 per cent of data centre failures, said Wong Ka Vin, managing director of data centre operator 1-Net Singapore.

This is why, beyond looking at how data centres are designed and how they are constructed, customers need to also pay attention to how the data centre is being managed and operated to meet its Tier objectives.

Wong was referring to the Tier Classification System developed by the advisory organisation Uptime Institute as a means to effectively evaluate data centre infrastructure in terms of a business’ requirements for system availability. According to Uptime, the Tier Classification System provides the data centre industry with a consistent method to compare facilities based on expected site infrastructure performance. It also enables companies to align their data centre infrastructure investment with business goals specific to growth and technology strategies.

Speaking at the recent DatacenterDynamics conference Singapore, Wong said customers were getting distracted because they were trying to understand what kind of data centre would be appropriate for their business. “Customers are coming to us with specifications cut and paste from different standards bodies without understanding the implications of what they are asking for.”

Highlighting the folly of this approach, he said, “It is important to understand the standards so you can buy the right product. If not, you could end up buying something expensive that you don’t need, or you could go to a cheap data centre that cannot deliver what you need. On either side of the coin, it’s dangerous.”

In its “Tier Standard: Topology”, Uptime institute defines the requirements and benefits of four distinct Tier classifications for data centre site infrastructure, each of which aligns with a specific function in the business world and sets the appropriate criteria for power, cooling, maintenance, and capability to withstand a fault.  The Tiers are also progressive, with each incorporating the requirements of all the lower Tiers.

Tier 1 provides basic capacity, a dedicated site infrastructure to support IT beyond an office setting, explained Wong. It comes with dedicated space for IT systems and an uninterruptible power supply to filter out power surges and momentary outages. “With Tier 1, once you have a fail scenario, it fails.”

With Tier 2, there is partial redundancy in various components, such as redundant critical power and cooling to provide for maintenance opportunities. This partial redundancy provides an increased margin of safety against IT process disruption that would result from site infrastructure equipment failure, said wong.

The next two Tiers cater to organisations with rigorous uptime requirements and a need to focus on long-term viability.

Tier 3 means that the data centre is concurrently maintainable, which is more relevant in today’s world, said Wong. It requires no shutdowns for equipment replacement and maintenance, and is important because data centres cannot have scheduled shutdowns. “You need to be able to isolate a component while maintaining it, without bringing down the service and without interruption to customer services. That is the key requirement for businesses - no interruption to my IT load while you fix your stuff.”

Tier 4 builds on Tier 3 by adding the concept of fault tolerance to the site infrastructure topology.Fault tolerance means that if or when individual equipment fails or a distribution path is interrupted, the effects of the events are stopped short of the IT operations.  According to Uptime, Tier 4 is justified most often for organisations with an international market presence delivering 24 x forever services in a highly competitive or regulated client-facing market space, such as electronic market transactions or financial settlement processes.

From an investment perspective, the design requirements for a Tier 4 data centre translates into a significant increase in cost, said Wong. So the question that he has for customers is, “Given the reality of what needs to be configured, will you be willing to pay the price for me to provide the service?”

In his view, for most scenarios, Tier 3 would suffice. “Keeping in mind the evolution of IT technology and the evolution of people moving into the cloud, Tier 3 really suffices.”

What is more important, he said, is for customers to know if their data centre provider will see through the entire certification process for their specified tier, from design to construction to operations.

The Tier Certification journey starts with the design of the data centre, where operators submit design documents to Uptime engineers for evaluation in order to achieve the Tier Certification of Design Documents.

However, as Wong pointed out, “a Tier 3 design-certified data centre doesn’t mean it is a Tier 3 data centre”. The next stage involves the more strenuous process of working towards the Tier Certification for constructed Facility, which has to be attained within two years of the design approval. At every stage of the data centre build-up, Uptime sends its engineers to do site inspection to ensure that operators do not “short circuit” their design.

The latest addition to the certification process, and in Wong’s view the most critical stage, is the Certification of Operational Sustainability, which addresses the issue of “bad practices”.

The operational review of the data centre’s operations, which is usually carried out two or three years after it goes live, will help determine “if we are Tier 1 team running a Tier 3 data centre or a Tier 3 team running a Tier 1 data centre.”

The Tier Certification of Operational Sustainability will enable customers to understand if the data centre is being managed or operated in accordance with the Tier objectives, said Wong.

From 1-Net’s point of view, it is also the ultimate goal. “It will enable us to develop a psyche within our team to ‘plan, do, check and act’, by anchoring this thinking on something that is relevant,” said Wong.

“As the industry evolves, customers are getting further and further away from the physicality of what we are doing. As customers move to the cloud, they are spending less time in the facility. It is therefore very important that the data centre operator or service provider has a solid means to reach out to the customer to communicate how we are delivering our services, so that the end customer knows and not just guesses that the service level agreement that I provided for them has backing of right design, and has the backing of the operational team that understands how to operate the design.”