With GoDaddy’s unfortunate DNS outage on September 10th, we received an enormous number of inquiries about our DNS services. A frequently asked question was whether or not Total Uptime could provide secondary or backup DNS services for disaster recovery. The quick answer is “yes”, we can definitely provide this commonly implemented DNS backup solution, but we thought it would be helpful to post a quick article to answer the question in a little more detail: What is Secondary DNS, and what are the benefits and drawbacks of using it?
Secondary DNS, sometimes referred to as Slave DNS or Backup DNS, is a simple setting that allows other DNS servers (secondary ones) to transfer copies of the entire zone file from the Master DNS server (sometimes called the Primary). Typically you must allow the IP address of the secondary server(s) at the primary DNS server before transfers are permitted, and you can even implement DNS TSIG to validate that the zone information received from the primary DNS server is legit. But for the purposes of this article, we won’t talk about zone transfer security. Secondary DNS has been a function of DNS since the RFCs were proposed in the 1980’s, is a pretty basic feature and is still widely used today.
The obvious benefit to secondary DNS is the fact that you can increase DNS redundancy by deploying more DNS servers. If you run your own cluster of DNS servers, you could use this method to replicate DNS amongst all of your servers (which was the original intention), but in the Managed DNS provider space like we at Total Uptime play in, secondary DNS is not what we use to replicate DNS amongst our nodes.
On the Total Uptime network we have 32 name server clusters around the globe and use our own proprietary method of synchronizing them in a speedy and reliable method. Most other service providers probably do the same thing, and for good reason which we will share in a moment. Since secondary replication is typically not required internally, this makes the functionality ideally suited for implementing a multi-service provider DNS deployment.
Most domain registrars require at least two name servers and support up to seven or sometimes more. At Total Uptime, we provide a long list of name servers with multiple TLDs to use that are on our global Anycast network to satisfy the minimum or meet the maximum. Using more than two, however, does not offer any benefits… at least here at Total Uptime. Our architecture is designed in such a way that using even one name server gives you the benefit, redundancy and resiliency of all 32 global nodes. That means you could theoretically use Total Uptime and six other DNS providers to create a massively redundant authoritative DNS implementation, assuming they were equal to some degree. One of them would be your primary provider, and the others would all use secondary DNS to replicate your DNS zone(s). In this hypothetical scenario, you would then use one name server given by each of the seven providers to enter into your domain at the registrar to complete the deployment (or more, if supported).
The obvious benefit here is that seven DNS service providers would have to be completely offline for your DNS to be down. Or, if you consider that DDoS attacks towards DNS infrastructure are a more likely cause for an outage, seven providers would have to be attacked to the point of complete saturation for your DNS to be offline.
“Wow!” you might say. So if service provider fees are no object and resiliency is the primary goal, that would sound like the perfect solution, right? So why don’t more organizations use secondary DNS to the extreme? The answer: With every good thing comes at least one drawback.
There are a few drawbacks which should be discussed before considering Secondary DNS.
You now have more DNS servers to worry about! True, you’ve increased redundancy significantly if you really get seven different providers, but now there are more DNS servers to worry about. As we touched on briefly near the end of our Anycast article, because name servers are used somewhat randomly during DNS resolution, if one of them is down, users may experience DNS time-out delays. So if one of your seven DNS providers has an issue, maybe 1/7th of your DNS traffic could see an increased delay of 5 seconds or so when pulling up your website. Naturally, this is an undesirable effect. It can be mitigated by removing the troubled provider at the registrar, but now you’re in the business of monitoring all of your DNS providers to ensure they are available so you can quickly log into your registrar to make changes. Does the extra redundancy mitigate potential resolution delays and extra monitoring?
Refresh Interval can delay update propagation. When a change is made to your DNS zone at the primary server, it should be configured to send a NOTIFY message (called ‘opcode’, RFC 1996) to the secondary DNS servers to let them know that a change has been made and that the secondary servers should check the SOA serial number to see if it is different, and if it is, request a zone transfer right away. But this doesn’t always work properly. Sometimes primary DNS servers have the NOTIFY message functionality disabled, sometimes they are older servers prior to NOTIFY implementation, and sometimes the messages simply aren’t received by the secondary servers for a variety of reasons such as an aggressive firewall. This creates a problem because until the secondary servers transfer the zone from the primary, they have outdated information which could derail DNS resolution sending traffic to outdated IP addresses.
There is a fail-safe for this called the Refresh Interval, part of the original Secondary DNS architecture. This is a value specified in the SOA record that tells all secondary DNS servers to refresh with an updated copy of the zone from the primary every x seconds. A recommended value we see for this setting quite often is 24 hours (RIPE recommends that here). This means that in the event the NOTIFY message is missed, within 24 hours (or less depending where in the cycle it is) the zone will be updated. Naturally, this is not ideal if you’ve just changed the IP address for your website! You can overcome this issue too by adjusting the Refresh Interval to be significantly shorter, such as 1 minute for example. But that creates a lot of unnecessary DNS traffic and even more critically CPU load on the master just to solve a problem that should not have happened in the first place.
The potential delay in updating secondary servers is not well suited for special features. Unique solutions like DNS Failover, which we offer here at Total Uptime, do not perform as they should when DNS changes don’t propagate the Internet quickly. Even a 5 minute delay to update an IP address on all of your DNS servers as a result of a web server outage could cost you valuable time and money. To make things even worse, features like GEO DNS which have unique technologies at each DNS node don’t work at all with secondary DNS. They are enhancements to the DNS designed decades ago that simply aren’t supported with some of our unique and custom DNS enhancements.
At Total Uptime, we have a 60 second propagation SLA across our entire global network. This means that if you (or the DNS Failover service) make a change in our GUI, it will be updated on all of our servers around the world in a minute or less. Of course, we (and any other provider) can’t extend the guarantee outside of the network, so secondary DNS servers are excluded from this. This is probably one serious reason for choosing one excellent provider instead of 7 mediocre ones!
Our answer to the entire Secondary DNS question is to choose one solid provider whose core business is running and protecting a serious global DNS network, like Total Uptime. All we provide are Cloud availability solutions. That way you can rest assured that you’re in good hands. If you still are uncomfortable trusting a single provider and don’t need special features like DNS Failover or GEO DNS, chose two whose core business is DNS. That way, you have fewer providers to manage and worry about, and the likelihood that both providers experience an outage at the same time is virtually zero.