Redundancy: When Too Much is Just Right

Redundancy is indispensable in the world of information technology. Of course, redundancy is not welcome in every aspect of life. If your company doesn’t need you anymore and makes you “redundant”, you’ll have to look for another job. Poorly written text may be credited to the Department of Redundancy Department. The concept of redundancy is that there is too much, an excess, more than is needed, a superfluity, a superabundance. But in information technology or aviation engineering, that can be a very good thing.

Failure Is an Option

A Boeing 747 can fly safely on a single engine.  So why did company engineers put four of them on the airplane?  Just in case. Before June 24, 1982, aviation experts believed that there was a ridiculously low chance that all four engines on a Boeing 747 could fail simultaneously. But that’s exactly what happened when, at 37,000 feet over the Indian Ocean, a British Airways flight lost all four engines and continued for 80 nautical miles in unpowered flight before the engines restarted. The culprit was volcanic ash.

“Redundancy is the single most important engineering tool for designing, implementing, and – importantly – proving reliability in all complex, safety-critical technologies.”

The story is recounted in a fascinating technical paper about redundancy in engineering written by John Downer and published in May 2009. The paper is called “When Failure is an Option: Redundancy, reliability and  regulation in complex technical systems”, and it is well worth reading. Downer says that “redundancy is the single most important engineering tool for designing, implementing, and – importantly – proving reliability in all complex, safety-critical technologies”. He tells how the computing pioneer Jon von Neumann discussed the idea in his writings in 1956. And he quotes Yale sociology Charles Parrow, who wrote:  “Two engines are better than one; four better than two.”

Any IT professional should be happy when a backup component takes over after the primary one fails. That means that the redundant system is doing its job. As Downer explains, “An element is redundant if it contains backups to do its work if it fails; a system is redundant if it contains redundant elements.” But as in the case of the 1982 Boeing 747 flight, even that may not be enough.

Single Points of Failure

Redundant systems may be all the rage in the world of IT, but a design that contains a single point of failure is anathema. A single point of failure (SPOF) means that an entire system can go down if that one element of the system fails. The concept is commonly bandied about in IT circles, but it’s just as applicable to any engineered system.

If you are in an airplane that has only one engine and that engine dies, you’re in real trouble. The same applies to an IT system without redundancy. Techopedia explains it this way: “Highly reliable systems are designed without SPOFs. This means that failure of a component, system or site does not halt system or operational functions.”

“Highly reliable systems are designed without SPOFs.”

It just makes sense to have a spare on hand when you need it. You carry around a spare tire in your car. Stores with point-of-sale cash registers keep extra rolls of printing paper close by as bench stock. And you can probably think of many other areas of life where backups and redundancy make it possible to continue even after a single element fails. The same is true for critical elements in IT. Systems that don’t provide redundancy and leave users vulnerable to single points of failure are just examples of poor system design.

Different Scopes of Failure

In order to develop a sound redundancy plan, it’s a good idea to anticipate how things could fail. Of course, now matter how well we prepare, Murphy’s Law could take over and things could fall apart. But just as in our recent blog post about proactive maintenance, being proactive about potential failures is going to pay off in the long run.

For this section, we look to a source from ComputerWeekly called “How to plan and manage datacentre redundancy”. The spelling gives away that the author is a Brit, and a bit of digging tells us that Clive Longbottom is a respected industry analyst and the founder of Quocirca. In this article Longbottom covers the whole gamut of possible IT failures, from component failure to the destruction of the world. (In that case, you would blindly send all your data into space with the hopes of recovering it in the future from a distant planet — hardly likely.)

The more realistic problem is that a single element fails, such as a hard drive, a server, a power supply, a connector. In this case, you could easily use N+1 protection. That means you have one primary and one backup. You could have server mirroring, a backup connection, or an uninterruptible power supply.

The next level of redundancy Longbottom calls assembly failure. This is common in multiservice devices like a chassis-based network switch, for instance. This type of switch might have a second controller or transmission card installed in its modular shelf. When one card fails, all processing or traffic simply fails over to the backup card. This modular approach ensures that the capabilities for the IT service remain interrupted. Of course, so many of these IT functions have been taken over by virtualization and cloud computing nowadays. But the same principles remain. Keep a backup system ready and available at all times.

Now Longbottom expands the scope to the entire room. What if power is lost to an entire data center area? There was a time in the 1990s when the hosting industry was expanding exponentially — until the market became oversaturated and companies like Exodus Communications went bankrupt. Before that happened huge data centers were being built across the world, each with incredible power redundancy. Picture two huge diesel backup generators, each the size of a railway locomotive. When the power goes out to part or all of the data center, one of these would automatically take over. If the first backup generator failed, the second one came online.

For the sake of brevity, we’ll summarize all the levels of Longbottom’s redundancy scheme. He discusses:

  • single element failure
  • assembly failure
  • room failure
  • building failure (e.g., fire or flood)
  • site failure (e.g,, local power failure)
  • city failure (e.g., major storm)
  • regional failure (e.g., earthquake or tsunami)
  • country failure (e.g., civil war or epidemic outbreak)

 

It pays to be prepared. Redundancy is an integral part of any disaster plan. The subject is too broad to discuss here in-depth.

Factors to Consider

There are some myths about redundancy that should be taken into account. Some people think that redundancy will solve all availability problems. But as we saw in the case of the British Airways flight, even the best plans don’t cover all eventualities. Another misconception is that redundancy is just too expensive. But maybe you should think of it as an insurance policy. You may spend more than you want for health, auto, or home insurance — but what about the devastating costs when something goes wrong? The alternatives to redundancy can be unimaginable.

Can you really afford for your internet or IT system to go down for an entire afternoon?

Another thing to consider is your level of acceptable risk. Can you really afford for your internet or IT system to go down for an entire afternoon? Maybe you can. If you’re a freelance consultant with a flexible schedule, maybe you will look at the system failure as a godsend, shut down the office, and go out for a round of golf. But if you have critical systems that determine your financial viability and business reputation, you would be a fool to neglect the implementation of redundancy.

Managing Risk

What’s the worst that could happen if your system went down? Have you calculated it? Can you put a dollar figure on the cost of a potential outage? That’s what redundancy is about. If you can’t live without it, then you must prepare. If that means forking over money for a backup system, securing offsite storage or processing capabilities, or installing a secondary connection — well, you’ll just have to bite the bullet and do it.

Implementing redundancy is just part of a wide array of risk management tools for business that are dependent on IT services. Better to have too much than not enough. Remember, things break down. And don’t forget about proactive maintenance, change control, security management…. The list goes on.

A Word About Backup

Sometimes it’s not that physical device or the software package that needs a redundant resource. Think about the data itself. What happens if you lose your data? A network engineer in England found out a few years ago. When he was working with the hands-on tech who was at the remote site changing a module in a telecommunications switch, he forgot one simple step in the replacement process:  Backup.

This topic is another large area of study for IT professionals. What data needs to be backed up? How often? What are the best practices for data backup? When should your data be backup up, and how can it be restored? What’s the difference between partial, incremental, and full backups? What hardware and software are involved? We’ll leave you with those questions.

Conclusion

There are plenty of times in life when too much is just too much. You might like the meal, but going back for seconds is excessive. You love your dog, but you don’t want a second one. You can get by with one car. Sometimes one is enough. But technical environments depend on superfluous components and systems. In IT, sometimes too much is just right.

Prevent your next outage now!

TRY IT FREE

Other articles you might like to read: