Service providers do everything they know how to avoid downtime. Generally the best practice is not to touch a live network. If it ain’t broke, don’t fix it. But change is inevitable, and eventually every network or system will need improvements. The trick is to handle these changes with little to no disruption of running services. That’s the purpose of change management.
Also called change control, change management is a key discipline that every IT professional should seek to master. The term can be applied to business processes, but it has special significance for the IT environment. Change management, in the context of information technology, is a systematic approach to the flow of change in IT infrastructure. Since unplanned outages can be extremely costly, change management is a means of identifying required actions for a given change and coordinating them with others in the organization.
Change management is an integral part of the Information Technology Infrastructure Library (ITIL). Developed to streamline and standardize IT practices, ITIL focuses on what is known as IT Service Management (ITSM). Within this framework, change management is considered a discipline of ITSM.
The creators of ITIL have done extensive work on the development of change management best practices. A product of the U.K. company Axelos, the ITIL methodology is so well received that it has been mapped into ISO 20000 Part 11, which comes from the International Organization for Standardization. The change management process is covered in the ITIL module called ITIL Service Transition.
ITIL is not the only organization to define practices for change control. Cisco has had detailed best practices for change management for many years. The preparations for year 2000 (Y2K) provide an example of industry-wide change control. IT companies across the industry have developed their own strategies for managing moves and changes.
So what is the objective of change management? According to ISO 20000, part 1, 9.2, it is:
To ensure all changes are assessed, approved, implemented, and reviewed in a controlled manner.
That’s the broad overview. Change management is a formalized process. It is part of an effort to define best practices in IT management. The goal is to master the changes.
The ancient philosopher Heraclitus believed in change. He said you could never step into the same river twice. The river may have the same name, but the water that flowed past your feet yesterday is already miles downstream. The point here is that change is an unavoidable part of life – and that goes for your IT environment as well.
Machines break down. They get old. Software becomes obsolete. Business requirements change. Resource availability (money, equipment, personnel) fluctuates. Whether you like it or not, if you are in charge of information technology you will have to deal with the dynamics of continual change. The question is how well you can control it.
The first step to mastering any system is to understand it. In IT infrastructure management, that means tracking all assets with a robust database. Sometimes that requires an audit using software to discover and document what’s actually on the network. The same goes for software licenses and versions. Do you have a regular system in place? Finally, identifying issues through surveillance, key performance indicator (KPI) tracking, and other analytics will give you an idea of what is happening with your network systems.
Once you discover that a change is necessary, you will need to find ways to manage and implement the change. That may mean developing step-by-step procedures for routine changes and formalized documents for special planned changes. Those implementing the changes find it helpful to have graphic depictions of the planned change, such as flow charts or diagrams.
Any change control process will have a lot of moving parts. A defined procedure serves to break down that process into its constituent elements to make it more manageable. It is a kind of A-B-C/1-2-3 approach to getting things done. In the change management business, skipping a step can cost the company many thousands of dollars and get the engineer fired. It’s much better to have a plan in place – and follow it.
It’s all part of risk management. What’s the worst thing that could happen? When you’re dealing with live equipment, the answer to that question could be a nightmare. So a good change management plan calls for sure and steady steps – and a way out if things start to go wrong.
The solution? Use a Method of Procedure (MOP). This is a fully developed document that details the planned change from start to finish. One of the reasons for using a MOP is to make sure you’ve thought everything through. We’ve all known IT cowboys who just jump in and do a job without considering the implications. The consequences for lack of planning can be overwhelming. A MOP takes care of all that.
What is a MOP all about? Let’s get into the nuts and bolts.
You’ve got to cover all the bases. Any change management method must be thorough. Let’s start by listing the most common components of the Method of Procedure before we discuss them.
Most of these require no explanation. But as with the procedure itself, skipping parts of the documentation can lead to problems down the road. The change request (CR) may also be called the request for change (RFC). Using a unique identifier, CRs can be tracked within a ticketing or trouble management system. Otherwise, a dedicated change control database can be used.
It may help here to highlight the sections related to risk assessment, cost and benefits, and rollback. If you haven’t fully accounted for the potential negative impacts of the change, then you should postpone it. Just as the patient going into surgery needs to know the risks before he signs off, you should be aware of possible outcomes of the change. Which users will be affected? What parts of the IT infrastructure are involved? Does the change require a separate maintenance window after hours? Can you roll things back to their original state if things go wrong? Do the potential benefits outweigh the costs and risks?
What about approvals and notifications? Do you have signatures (or confirmation emails) from managers both from the technical and business sides of the house? Some companies run a change control board, which meets regularly to review and approve changes. Those who might lose service during the change should also be informed well in advance.
Do you have a way to test the MOP? Many companies are able to run test cases in the lab before enacting the change in the live environment. Another option is to limit the change to beta customers who agree to be the test subjects. Firming up the procedure before any large-scale implementation is a good idea.
Now we come to one of the best reasons for adopting a robust change management program: Eliminating downtime. It may seem counterintuitive to take down a system in order to keep it running, but let’s look at an illustration.
Your car has been acting up for a few days. You hear a strange noise, and you don’t know what it is. You call the garage and get on the schedule for diagnosis and possible repairs. While the car is in the shop, it’s down. You can’t drive it. If you need transportation, you will need to make other plans. If you’re lucky, you might be able to get a loaner car from the shop. But once the mechanic has made the necessary changes, you are good to go. Your change management strategy may have saved you the “downtime” of a future major breakdown on the highway at the worst possible time.
There’s an obvious difference between a planned outage and an unplanned outage. With planned maintenance on your IT infrastructure, you can continually optimize your network and systems architecture for your clients and users. Proactive replacements or upgrades can prevent major headaches down the road. That’s much better than a major breakdown.
Not only can you prevent disruptive events with good change control, you can minimize downtime and maximize service during a planned maintenance. Using redundant systems (like the loaner car), you might be able to pull off a change in such a seamless way that none of the users notice it at all. For example, by using something like Total Uptime’s Cloud Load Balancer or Network Failover service, you can seamlessly direct users from one server or one site to another while you perform the work, ensuring continuity of service for end-users while maintaining full control. Following your MOP step-by-step offers you the best chances to keep those IT services up and running as expected.
You could probably write this section yourself based on your experience with in technology. It’s usually the one who thinks he knows it all who falls the hardest. Overconfidence (pride) can be a costly sin in the IT business.
One engineer was working a standard operation procedure (SOP) for a major telecommunications company. Over the phone, he advised the field engineer to replace a module in a telecom switch. What he failed to do was back up the data before the change. The error cost the company several hundred thousand dollars and resulted in significant and needless extended downtime.
What’s remarkable about that failed change is that there was a clear step-by-step procedure for the process, and it clearly included a line containing the instruction to back up the data. If he had been careful, slow, and methodical in the process, he would have been fine. So it’s not enough to develop a robust change control system with good documentation. The personnel implementing the change need to follow directions.
A smart IT professional knows when he can play around. That works fine in the lab. You can do it to some extent on a system that’s not carrying live traffic. But when an IT asset is already in production, it’s best to leave it alone – until you have a satisfactory plan in place.
Some of the biggest outages among major service providers have been caused by human error. It’s not enough to believe that you know what you’re doing. Any technician or engineer needs to think twice before tapping on the keyboard on a live system. Change management helps reduce those errors and has the potential to minimize downtime significantly. And it helps well-meaning IT professionals do their jobs with greater quality control. No one who manages information technology is expected to be perfect. After all, we’re only human.