Uptime is a key performance indicator (KPI). Some would say it is the key performance indicator, the sine qua non, of productive computing. If you can’t keep your system operational, you have nothing. None of the many functionalities – the bells and whistles – matter one whit if your customers can’t access your site or service. The expectation in the industry is for near 100% uptime.
So how do you get there? Every company’s IT environment is different. But the concepts of maintaining uptime are common to all, none of which are the result of luck.
The standard for network uptime is 99.999% availability. It is also referred to as Five 9s. Of course, the only way to determine that figure is by measuring it. And that all depends on what you are measuring.
Identifying how long a server has been operational is fairly easy. On UNIX computers, for instance, the metric is expressed in years, days, months, minutes, and seconds. The command is simply [uptime], and the results are straightforward. It shows how long the machine has been powered on with the operating system running. What it doesn’t tell you is how long a particular service has been running, or whether the server has been available online.
But what if the server is up and running on the internet but its essential services have crashed? What if only some of the users can access the server? Some people use the term availability for this metric. It makes no difference if the administrator can ping the server from his workstation but on one can reach it. The server has 0% availability.
Anyone trying to measure uptime should first clarify terms and define the measurement appropriately. Uptime on a server is not necessarily the same as uptime for an application on the server.
In the traditional data center, the focus was on ensuring that servers, routers, and switches continued running, that they were available on the network to the customer, and that the performance of these components was satisfactory. Network operation centers used SNMP-based tools to monitor managed objects.
In the early days, much of the work was reactive. A switch turns from red to green, and a NOC technician opens a ticket. In time, proactive automated tickets took over much of that work. Eventually, self-healing networks became a reality, and many of the fixes became automatic.
In today’s world of cloud computing, virtualization, analytics, and artificial intelligence, our monitoring systems have become much smarter. In fact, there is a movement toward autonomous networks with automatic resource allocation. It’s all getting better.
Of course, there is not going to be 99.999% (or 100%) uptime if the resources are not available. Automatic failovers need devices to failover to. And replacement parts should already be readily available, if not already onsite. When a card in a switch fails – even if traffic has been moved to another switch — there will still be the need for some technician to be onsite and physically replace it.
But now that so much of our infrastructure is going virtual, the footprint for actual hardware is continually decreasing. Even so, any environment for virtual machines, software defined networking (SDN), or network functions virtualization (NFV) should have the resources and capacity for seamless failovers or redirects. Many of these issues are being addressed in current advancements.
There is a society within the IEEE organization that is devoted to the concept of reliability. It is called the Reliability Society. Their webpage states their purpose: “We want to assure that a system will perform its intended function for the required duration within a given environment, including the ability to test and support it throughout its total life cycle.”
Reliability is a quality that is essential in our friends as well as our computer systems. Without it, maintaining uptime becomes much more difficult. Better to have a machine or application that keeps on going and does what it’s supposed to do than to deal with frequent repairs. There are rumors on the internet about a Novell Netware 3 computer whose uptime was 16 years before it was finally shut down.
Thankfully, many of our computing resources are now virtual. That was not the case with the 1940’s computer called ENIAC. Repairmen eventually were able to locate and replace one of its 18,000 vacuum tubes in just 15 minutes. Dealing with today’s IT environment can be much easier.
That depends, however, on how qualified the people are who are dealing with the problems. An experienced engineer may be able to resolve an issue in five minutes that would take a newbie several hours. It helps when there are clear processes in place, good documentation, and a robust knowledge base.
The IBM model for assessing the adequacy of a system is called RAS. It stands for reliability, availability, and serviceability. The uptime of any system is dependent on a variety of factors. The chief element is the desire for high quality. A combination of good design, manufacturing, operation, and maintenance will give a system a better chance. Quality is the key ingredient.