Root Cause Analysis to Maintain Uptime

“Houston, we have a problem…”

For NASA, the disaster aboard Apollo 13 required a systematic approach. And that’s what they got thanks to the Kepner-Tregoe methodology. The decision matrix developed by Charles H. Kepner and Benjamin B. Tregoe was employed at NASA to help bring the astronauts home. It is a step-by-step approach. And it is considered to be the forerunner of modern-day root cause analysis.

What is Root Cause Analysis?

Root Cause Analysis, abbreviated RCA, is a problem-solving approach. The aim is to identify underlying causes of troublesome events which may availability or other uptime issues. Just as a plant may have a whole system of roots below ground where we can’t see them, the nagging problems that we face in business or IT may be considered systemic. There may not be one individual reason for the problem. It could be a whole host of causes for a recurring issue. With RCA, you are going a bit deeper into troubleshooting than you might normally do.

It would be a misnomer to call RCA a method. There are many possible methods for investigating the roots of problems. Two of the best known methods are the Five Whys and Fishbone Diagram. (More on these later.) Modern-day RCA tools include analytics and sophisticated statistical software.

The consultants at ThinkReliability have simplified root cause analysis to three basic steps:

Define: What’s the problem?
Analyze: When did it happen?
Solve: What will be done?

Root cause analysis is, above all, a thinking approach. It involves a logical investigation – just as you might expect on a crime detective TV show. It’s about gathering information, speculating on possible causes, and determining whether your hunches are correct.

Kepner-Tregoe Problem Analysis

The systematic approach of the Kepner-Tregoe (K-T) methodology is so simple as to seem intuitively obvious. But working without a problem-solving plan can be chaotic. As described by the blog Quality Matters, the five steps of the K-T process are:

Define the Problem
Describe the Problem
Establish possible causes
Test the most probable cause
Verify the true root cause

We could elaborate on each of these steps, but the blog writer has already done the work for us. And frankly, the steps are self-explanatory. “This method,” says the writer, “advocates a rational and systematic approach to analyzing a problem without jumping to conclusions or making assumptions based on past experience.”

“Problem solving should not be a spontaneous grasping for straws to try to plug a leak.”

The Kepner-Tregoe consultancy remains in operation decades after their Apollo experiences. On their website, the company says, “Software and templates don’t solve problems. People solve problems.” They see thinking as a process. “Problem solving should not be a spontaneous grasping for straws to try to plug a leak.”

The K-T method has been influential in the development of root cause analysis. It is more a way of thinking than a specific procedure.

SixSigma and DMAIC

SixSigma is a set of methods pioneered at General Electric under the direction of CEO Jack Welch. One of the tools in their toolset is a methodology called DMAIC, which stands for:

Define
Measure
Analyze
Improve
Control

There is often overlap among the different approaches. The DMAIC points track closely with the K-T method. DMAIC is often used in conjunction with the best practices defined by the Information Technology Infrastructure Library (ITIL).

The focus of DMAIC is on continuous improvement. It is not enough to measure and analyze a problem. One must also improve the situation and control it. Quality control is a key discipline involved in all these methods.

Key Concepts

RCA investigations are systematic. That means that they are methodical and done according to a fixed plan. You might say that it is a scientific approach to problem-solving. In fact, the steps of root cause analysis are reminiscent of the scientific method that we learned in school.

Dr. Sylvia Wassertheil-Smoller discussed inductive reasoning and the scientific method with the online publication LiveScience: “We make many observations, discern a pattern, make a generalization, and infer an explanation or a theory.” Root cause analysis is like that.

Root cause analysis is about causation and correlation. Scholars like to explain the difference between causation and correlation. A correlation between two events does not prove that there is a causal relationship. But IT professionals are used to seeing patterns and trends that point to root causes.

Here’s an example of correlation. Years ago at the Frame Relay help desk of a major telecommunications company, techs noticed that there were a significant number of alarms on Friday afternoons at about 5:00pm. It didn’t take much logic to determine the cause. The Frame Relay outages coincided with quitting time before the weekend. Customers were turning off their computers and their modems and going home.

This idea of correlation ties in with an approach that is closely linked to RCA: trend analysis. It starts with identifying relevant parameters and setting thresholds. Key performance indicators play a role in network management. They can also be instrumental in determining root causes.

These days lots of analysis is done using big data and analytics. Mature data mining techniques make better root cause analysis possible. Pinpointing root causes with sophisticated technology is easier than ever.

Tools and Methods

The first method that most people talk about when explaining root cause analysis is something called the Five Whys. There is no magic here. Simply ask questions until you get to a deeper understanding of the root cause of the problem. This method is not exactly scientific, but it can be a quick and dirty way to figure things out.

A common example used is of a leak on the data center floor. It goes like this:

There is water on the data center floor.
Why is there water on the data center floor?
Because it is dripping from one of the ceiling tiles.
Why is it dripping from one of the ceiling tiles?
Because there is a water line above the ceiling.
Why is there a water line above the ceiling in the data center?
Because the contractor installed it there.
Why did the contractor install it there?
Because that was the lowest cost option.
Why did we choose the lowest cost option?

As you can see, it’s not a perfect method. The answers could go a lot of different ways. Questioning the wisdom of hiring that data center contractor is one of them. But another person answering those questions could have gone in a different direction. At least the Five Whys can start you thinking.

Another method used in RCA is called the fishbone diagram. The aim is to identify multiple possible causes instead of just one. Imagine that the problem is in the form of the head of a fish. Behind the head is the diagram of a fish skeleton. Each of the bones slanting backward are causes. The method was developed by Japanese quality control expert Dr. Kaoru Ishikawa. The fishbone diagram can be developed in a number of ways.

Of course, these two methods are rather off-the-cuff. They are exercises that could be done in a matter of minutes. This does not necessarily get to the bottom of recurring problems that just won’t go away. For that you may need software solutions.

Such software might include the Five Whys, the fishbone diagram, and a whole set of other tools and methods for analyzing issues. A quick internet search will help you find RCA software. Telecom and IT companies have their own ways of isolating issues. That may include spreadsheets, databases, KPI reports, or applications developed in-house. It helps to case a wide net.

Trends in IT Problems

Discerning a pattern, as suggested by Dr. Wassertheil-Smoller, is one of the concepts of scientific observation. In information technology, it is standard practice to regularly collect information about alarms, events, and other notifications regarding the IT infrastructure. These may be held in a database for an extended period. That’s a very good idea.

It is common for a network engineer, when evaluating an intermittent problem, to refer to the historical record. In the old days these may have been held in a simple file on a Linux or UNIX server. Using the grep command, the engineer could query for specific logs that might shed light on a given problem.

These days there are plenty of tools to deal with all that. Root cause analysis benefits from sophisticated software applications that help identify patterns and trends in historical data.

One of the ways to do that is to set thresholds. This is the same way that SNMP traps are generated. When a particular parameter exceeds the limits of the threshold, a notification is created. Using thresholds, it’s possible to collect a whole host of data to be used for statistical analysis at a later date. This way you are no longer feeling your way through with thought processes like the fishbones. You are crunching real numbers to get real results.

How RCA Affects Uptime

The nagging problems you experience now could turn into bigger issues later. If you are getting a lot of alarms from a particular network element, it’s best not to ignore them. The time that you take to analyze and correct these issues ultimately determines the uptime of your IT infrastructure and could save your company and your job in the long run.

Did you ever notice that something wasn’t quite right about a system process or a network connection? It seems to work, but it may be slow or give strange signs that there is something wrong. Now you could continue that way indefinitely without any incident. But it may be that when there is more of a load put on the connection or process, significant unplanned downtime will result. Conducting a thorough preemptive investigation using the best practices of root cause analysis could save you a lot of headaches down the road.

What about a major outage? You are relieved when everything is working again, but do you really know what caused it? If you don’t investigate the causes and take appropriate actions, the same kind of outage could happen again.

Conclusion

Root cause analysis is an excellent tool for keeping your IT infrastructure healthy. You may need some in-depth troubleshooting to correct an ongoing issue. Or you may be tasked to do a postmortem on a problem that is already resolved. RCA is also a very good approach for dealing with intermittent issues. Whatever the situation, root cause analysis can be your best friend if you are trying to keep an IT service up and running well.

The static IP that doesn’t break when your ISP does

The static IP that doesn’t break when your ISP does Ask any IT manager what happens when their primary internet provider goes down, and you’ll usually hear the same reassuring

What is an Application Delivery Network?

Today’s networks are focused on results. It’s not enough to have active connections with no errors. Customers expect that IT management teams do their best to deliver the applications that

Global Server Load Balancing (GSLB)

What is it? And how can you implement it? SHORTCUT: If you just want to get to the part about configuring GSLB in the Total Uptime panel, skip to it

Measuring Failure

If you are goal-oriented, you know how important it is to measure success. For network professionals, the goal is usually 99.999% availability (well, 100% in our world). But despite all

New feature: DNS Change Alerting

DNS Security starts with knowledge We recently released a new feature that will help DNS admins stay in the know. It helps by actively monitoring your DNS zone within the

Notable Network and Cloud Outages of 2021

There are hundreds of cloud and network provider outages every week, more than anyone ever realizes. To help convey the sheer volume of cloud provider and network provider outages, we’ve