EPO: Emergency Power Off or Extremely Probable Outage?
In 1959 there was a fire in the Pentagon. It resulted in seven million dollars’ worth of damage, taking out three mainframe computers. In today’s currency that would be $58.1 million. The National Fire Protection Agency (NFPA) was then tasked with creating rules and regulations to manage risks in computer environments. They came up with the NFPA 75 standard.
What they decided was that Data Centers using this standard would be required to have the ability to disconnect power to the electronic equipment, and a separate ability to disconnect power to the dedicated Heating, Ventilation, and Air Conditioning (HVAC) systems. These systems must be grouped, identified and readily accessible. With modern Uninterruptible Power Supply (UPS) systems, they too have become part of the standard, and must possess the ability to be shut down.
However, nothing in the standard says there should be one single button that accomplishes all these tasks. There is a fundamental dichotomy between the three parties responsible: the NFPA that created the standard, local building and fire code makers, and the inspectors that enforce the local regulations.
Data Centers are only required to follow the local building and fire codes. NFPA 75 lets the owners and designers know what is expected; it makes datacenter clients feel comfortable when told that the datacenter is NFPA 75 compliant. Nothing prevents a datacenter from being better than NFPA 75, any more than it prevents them from being worse, or even non-compliant. Local authorities are precisely that—authorities—and they get to set the standards for their communities.
NFPA came out with NFPA 70, which is mandatory, however, it does not specify an EPO button. If you have no electrical cables under the raised floor, IT equipment boxes are secured to the floor, and you’re not claiming to follow NFPA 75, you do not legally require an EPO button. Problem solved, right?
Not really… Fire Chiefs and Building Inspectors tended to embrace the NFPA 75 as if it were the actual law. You may be completely within your rights not to have an EPO button, but without their approval you cannot open your Data Center. Getting them to change their mind can be difficult, so be prepared to negotiate rather than argue.
Society Has Changed
We have become incredibly dependent on our Data Centers. Information, and accessibility to it, has insinuated itself into every aspect of our daily lives. The incredible convenience has also made us vulnerable in new ways.
Consider the 2007 incident in California where an annoyed technician hit the single EPO button in the datacenter that controlled the state’s electrical grid. This late night fit of pique occurred on a Sunday evening at approximately 9:00 PM; it was fixed reasonably quickly. If, however, it had happened during heavy usage, the cascade effect could conceivably have taken down the entire West Coast Power Grid for the United States.
What Should Change
The standard says that “systems must be grouped, identified and readily accessible.” It does not say that there must be a Big Red Button that can be leaned against, banged with a mop handle, mistaken for a different function, or operated maliciously without significant effort.
Such buttons should be under obvious video camera surveillance; they should be inside well-marked enclosures identifying their function; and they should sound a loud alarm when someone starts to open it. If someone does open it, there should be a multi-step process that gives enough time for an angry employee to think about the consequences.
Depending on the jurisdiction and severity of the incident, one can be charged with an Act of Terror or Sabotage. The California utility employee mentioned above faced five years imprisonment and a $250,000 fine.
If that sounds severe to you, remember that it is theoretically possible to turn every single traffic light in New York City off at the same time, which could result in hundreds of deaths, as well as complete paralysis for the city. Any large city is dependent on its electrical grid. Knocking that out could kill hundreds at home living on life support, trap thousands in elevators, interrupt air-conditioning during a heat wave, threatening seniors and babies, or disable heating for millions in the middle of winter.
If you meet the requirements to do away with an existing EPO button it might be worth your time to consider doing so. The Uptime Institute would easily classify the five-year availability of Tier IV data centers (the highest level of redundancy) at 100% were it not for the two biggest causes of downtime: fires and accidental EPO activation.
Modifications to the standard currently under consideration include Zoned EPO Buttons. Instead of shutting down an entire facility these will allow shutting down specific areas, allowing other servers to pick up the slack, and endangering far fewer lives, property, and resources.
One thing we know for certain: whether accidental or malicious, pressing an EPO button will be very expensive with downtime, recovery time, over time, and damage to your reputation. There are alternatives; let’s use them.
Other posts you might like...
The True Costs of Downtime for IT
Downtime is a dirty word in the IT business. Unplanned outages are unacceptable and should not be tolerated. In a universe where customers expect services to be available 99.999% of the time, any time your IT service offering is down is costly to your business.
The Need for Increased Availability is Now
Our predictions for the last half of 2017: Ransomware will keep evolving, the rise of IoT will pave way for increased DDoS Attacks, IPv6 Traffic will continue to grow exponentially, Machine Learning and AI will be applied to enhance security, and the need for increased availability is now.read more
5 Ways to Increase Application Availability
A service provider that offers software-as-a-service or another cloud-based solution should understand what customers are looking for and what compels those very customers to choose an off-premise, “cloud-based” solution vs. the more traditional on-premise, self-hosted solution.read more