Tackling a Christmas light-up

by

18 May 2014
Mayda Lim, Head of Implementation & Support, Technology Operations, Thomson Reuters

What do you do when there is a Christmas light-up? Mayda Lim, Head of Implementation & Support, Technology Operations, at Thomson Reuters, shared some insights into this this and other aspects of managing a major incident during a recent IT Service Management Community of Practice sharing session at the Institute of Systems Science.

Lim’s department supports the Thomson Reuters Finance and Risk organisation, which provides real-time trading data to customers. Through a private cloud infrastructure, it connects Exchanges, news feeds and contributors and redistributes information to customers at a rate of 2.6 million updates a second. The operation is as real-time as it gets. “Every millisecond or nanosecond, there are traders making money. This is where wealth is generated,” said Lim.

Resilience is a key aspect of the infrastructure, both at the shared infrastructure level and at the services level. Hot-hot data centres ensure that if one facility were to go down, another would kick in immediately, or if one rack were to fail, another would take up the load.

Thomson Reuters also took the additional step of putting in a major incident control process to manage outages, even though it already had resilience in place. “It is not about getting insurance, but managing unknown risks,” said Lim.

The foundations for this were first laid in 2002 when Thomson Reuters embarked on its ITIL (IT Infrastructure Library) journey. ITIL is a set of practices for IT service management that focuses on aligning IT services with the needs of business. It provided the organisation with the toolsets and processes required to scale when it came to service management.

In 2004, when Thomson Reuters moved from a distributed IT environment to a more centralised one, it decided to further fine-tune its incident management processes and adopt best practices in major incident control.

Within Thomson Reuters, a major incident is defined as one which involves a partial or complete service failure leading to extreme impact on the business. The causes of these incidents could vary. For example, a major incident could be triggered by a single change into a server that leads to an outage, or it could be a situation where certain legacy assets break down.

When these incidents lead to brand damage, or when the company loses resilience and the path to recovery is not clear, the incident control process is triggered and a war room is set up to coordinate recovery efforts.

The deliverables

There are three key words that encapsulate what the Incident Control Centre (ICC) has to deliver, said Lim. The first of these is “communicate. “The stakeholders need to know if there is a major incident. If anything is wrong, we need to tell the customer immediately.”

For example, internal staff will be notified if some trading data is not available, or external customers and stakeholders will receive service alerts if a particular data set is suspect, so that they will stop trading.

The second keyword for the war room is “escalate”. This is important because the company is supported by multiple IT vendors looking into different aspects of its operations, for example, networking, data hosting, and others. “We need to make sure that we escalate to the right group and the right vendor to help us do restoration.”

The third keyword is “prioritise”. When a major incident occurs, there will be a “Christmas light-up”, said Lim. “When an alarm storm is triggered and the screen is all red, what will have the highest priority?”

The parties in the war room will have to make the decision and prioritise which segment is to be opened up first. For example, if the incident occurs before trading opens in Asia, it could be that Japan gets priority because the market opens an hour before Singapore, and there would be another 60 minutes to recover the Singapore operations.

ICC process has to be focused on the customer, said Lim. She cited the example of Hurricane Sandy, which brought down many data centres in New York in 2012. Some services like email took a hit, but whilst these were critical internally, they were not critical for the customer. The priority would thus be to recover for the customer first.

Lifecycle and roles

The ICC can be activated any time of the day - 24x7 x365 – and everyone who has a functioning role will have to be available to support the process.

The Incident Management Group (IMG), which includes roles such as the incident manager, recovery manager, operations service manager and communications manager, meets within 30 minutes of the ICC being initiated. It is responsible for looking into decisions such as: What are the resources we can deploy? How much money do we want to draw to recover the service? Are we willing to pay a premium to have an alternative path to bring system up? 

For example, if there is an outage at a data centre, the IMG will have to consider the investments that have to be put in to recover the service as quickly as possible. But if the particular market is about to close, the decision could be not to invest in additional resources but to do the recovery overnight. With IMG in this role, the Incident Recovery Team (IRT) can focus on technical recovery instead of shouldering all the responsibility for the investments and costs involved, because these would involve business justification.

The IRT comprises technical people doing the recovery work. They will need to explore the technical options for recovery and present them to IMG, so that a decision can be made as to which course of action to take.

Another key process in the war room is the Management Team Meeting (MTM), which is convened to make decisions based upon information provided by the IMG. These meetings are capped at 20 minutes to ensure that they are very focused on decision-making to facilitate the recovery process. The MTM also provides support and guidance as appropriate, and if necessary, it will escalate the incident to the C-level Emergency Management Committee.

Through the ICC, therefore, there is a consistent approach for carrying out the recovery and the same procedure is then replicated elsewhere if other countries are affected by the major incident.

The war room remains open 24x7 for the duration of the incident, which could run for days. “We will stand it down when there is a clear path of recovery and everyone has agreed on the option that we have decided, or when the service is restored,” said Lim. “A conscious and recorded decision will be made to stand down all ICCs and the service alert is updated to reflect the fact that the ICC has been closed.”