We end the year with a major CenturyLink outage that started early Thursday morning for many and spread across the country, according to reports on Reddit, GeekWire and Newsweek affecting Internet, 911 services and other internet-dependent services like waves and VoIP. The outage also affected other providers who lease long haul connectivity from them, such as TATA Communications and GTT, to name a couple.
CenturyLink was strangely quiet on the matter as their Twitter feed didn’t show the first acknowledgement of the issue until 10 hours in. At 18:28 UTC they posted the first entry below to some customers who were able to communicate with their NOC:
On December 27, 2018 at 02:40 GMT: CenturyLink identified a service impact in New Orleans, LA. The NOC is engaged and investigating in order to isolate the cause. Field Operations were engaged and dispatched for additional investigations. Tier IV Equipment Vendor Support was later engaged. During cooperative troubleshooting a device in San Antonio, TX was isolated from the network as it was seeming to broadcast traffic consuming capacity, which seemed to alleviate some impact. Investigations remained ongoing. Following the isolation of the San Antonio, TX device troubleshooting efforts focused on additional sites that teams were remotely unable to troubleshoot. Field Operations were dispatched to sites in Kansas City, MO, Atlanta, GA, New Orleans, LA and Chicago, IL. Tier IV Equipment Vendor Support continued to investigate the equipment logs to further assist with isolation. Once visibility was restored to the site in Kansas City, MO a filter was applied to the equipment to further alleviate the impact observed. All of the necessary troubleshooting teams in cooperation with Tier IV Equipment Vendor Support are working to restore remote visibility to the remaining sites at this time. We understand how important these services are to our clients and the issue has been escalated to the highest levels within CenturyLink Service Assurance Leadership.
Update at 19:16 UTC: Efforts to regain visibility to sites in Atlanta, GA and Chicago, IL remain ongoing. Once visibility has been restored the filter will be applied to limit communication traffic between sites which was causing CPU spikes that in turn prevented the devices from functioning properly.
Update at 20:21 UTC: Tier IV Equipment Vendor Technical Support continues to work with CenturyLink Field Operations and Engineering to restore visibility and apply the filter to devices in Atlanta, GA and Chicago, IL. While those efforts are ongoing additional logs have been pulled from the devices in Kansas City, MO and New Orleans, LA following the restoral of visibility and the necessary filter application to obtain additional pertinent information now that the device is remotely accessible.
Update at 04:47 AM UTC Friday via Twitter: CenturyLink engineers have identified a network element that was impacting customer services and are addressing the issue in order to fully restore services. We estimate services will be fully restored within 4 hours. We apologize for any inconvenience this caused our customers.
Update at 13:15 AM UTC Friday via Twitter: We discovered some additional technical problems as our service restoration efforts were underway. We continue to make good progress with our recovery efforts and we are working tirelessly until restoration is complete. We apologize for the disruption.
Update at 21:04 UTC – Total Uptime upstream providers who lease capacity from CenturyLink confirmed they saw the final circuit restoration. The CenturyLink RFO stated:
Root Cause: A Century Link network management card in Denver, CO was propagating invalid frame packets across devices. Fix Action: To restore services the card in Denver was removed from the equipment, secondary communication channel tunnels between specific devices were removed across the network, and a polling filter was applied to adjust the way the packets were received in the equipment. As repair actions were underway, it became apparent that additional restoration steps were required for certain nodes, which included either line card resets or Field Operations dispatches for local equipment login. Once completed, all services restored.
RFO Summary: On December 27, 2018 at 08:40 GMT, provider Century Link identified an initial service impact in New Orleans, LA. The NOC was engaged to investigate the cause, and Field Operations were dispatched for assistance onsite. Tier IV Equipment Vendor Support was engaged as it was determined that the issue was larger than a single site. During cooperative troubleshooting between the Equipment Vendor and Century Link, a decision was made to isolate a device in San Antonio, TX from the network as it seemed to be broadcasting traffic and consuming capacity. This action did alleviate impact; however, investigations remained ongoing. Focus shifted to additional sites where network teams were unable to remotely troubleshoot equipment. Field Operations were dispatched to sites in Kansas City, MO, Atlanta, GA, New Orleans, LA and Chicago, IL for onsite support. As visibility to equipment was regained, Tier IV Equipment Vendor Support evaluated the logs to further assist with isolation. Additionally, a polling filter was applied to the equipment in Kansas City, MO and New Orleans, LA to prevent any additional effects. All necessary troubleshooting teams, in cooperation with Tier IV Equipment Vendor Support, were working to restore remote visibility to the remaining sites. The issue had Century Link Executive level awareness for the duration. A plan was formed to remove secondary communication channels between select network devices until visibility could be restored, which was undertaken by the Tier IV Equipment Vendor Technical Support team in conjunction with provider Field Operations and NOC engineers. While that effort continued, investigations into the logs, including packet captures, was occurring in tandem, which ultimately identified a suspected card issue in Denver, CO. Field Operations were dispatched to remove the card. Once removed, it did not appear there had been significant improvement; however, the logs were further scrutinized by the Vendor’s Advanced Support team and Network Operations to identify that the source packet did originate from this card. Provider Tier III Technical Support shifted focus to the application of strategic polling filters along with the continued efforts to remove the secondary communication channels between select nodes. Services began incrementally restoring. An estimated restoral time of 09:00 GMT was provided; however, as repair efforts steadily progressed, additional steps were identified for certain nodes that impeded the restoration process. This included either line card resets or Field Operations dispatches for local equipment login. Various repair teams worked in tandem on these actions to ensure that services were restored in the most expeditious method available. By 2:30 GMT on December 29, it was confirmed that the impacted IP, Voice, and Ethernet Access services were once again operational. Point-to-point Transport Waves as well as Ethernet Private Lines were still experiencing issues as multiple Optical Carrier Groups (OCG) were still out of service. The Transport NOC continued to work with the Tier IV Equipment Vendor Support and provider Field Operations to replace additional line cards to resolve the OCG issues. Several cards had to be ordered from the nearest sparing depot. Once the remaining cards were replaced it was confirmed that all services except a very small set of circuits had restored, and the Transport NOC will continue to troubleshoot the remaining impacted services under a separate Network Event. Services were confirmed restored on December 30, 2018 at 14:43 GMT
Various posters on Twitter and Reddit who were able to get through to CenturyLink stated reports that reps were unable to access internal systems including ticketing, some telecom and Skype. At Total Uptime, we noticed the largest impacts on the west coast in Santa Clara and Seattle starting at 12:53 AM pacific time. Automation immediately rerouted affected traffic via alternate providers to mitigate any impact for our customers.
This is another confirmation of our previous writing in 4 Cloud Gotchas to Avoid that moving to the cloud does not increase availability whatsoever, especially when you notice CenturyLink’s interesting status page (original screenshots below) confirming that the outage was affecting 15 of their 17 External Cloud Network product locations. Wow.
At Total Uptime we lease network capacity from most of the world’s largest providers giving us the unique ability to pick and choose routes for our customer traffic that deliver the best availability and performance at any moment. We think remaining data center and provider agnostic is definitely the way to go, especially as people move more and more to the cloud.
Today it is common for every cloud provider to offer load balancing, firewall, DNS, VPN and more, but putting all of your eggs in one basket – especially the network – is rarely the best choice. At Total Uptime, we will always remain independent of every data center, backbone carrier/ISP and network because availability is at our core. Without uptime, nothing else really matters.
If we can help your organization avoid today’s exciting CenturyLink related events, just contact us.
Ironically after writing a blog post earlier today commenting on how ISPs are not the best organization to handle DNS because it is not their core competency, AT&T suffered a significant DNS outage that affected many customers across the United States for many hours. Network World has a few more details as does ComputerWorld. AT&T did not release […]
Yes, even the biggest and best organizations can suffer tremendous losses due to something as simple as a DNS issue. Unless you are immersed in DNS and it is one of your core competencies, it is easy to make a mistake, and that may be what transpired at Apple. As reported at ars technica, CNBC and various other […]
The cost of datacenter downtime has increased more than 40% for many companies over the last 3 years, according to a recent study by Ponemon Institute, sponsored by Emerson Network Power. The report analyzes 67 datacenters during 2013 across a variety of different industries ranging in size from 2500 sq. ft to over 46,000 sq. […]
If you went to bestbuy.com and the site was unavailable, how long would it take for you to go to amazon.com or elsewhere to find what you wanted? On average, it’s less than 30 seconds; it used to be much longer, but our society has grown impatient. If you’re not available when customers are looking […]