Failover Design: DNS Failover vs. Cloud Failover

When implementing a failover solution, the most common questions we receive are:

  • How can we architect a failover solution for our application?
  • How quickly can we failover from the primary site to the secondary site?
  • How quickly can it fail back when the primary comes back online?
  • Is there a way to prevent automatic failback when I have database synchronization to do?

To answer these questions, we need to better define the two failover solutions that Total Uptime offers so you can determine which may best fit your application or desired handling of failures.

1. The DNS Failover Solution

As the name suggests, DNS Failover is designed to operate at the DNS level. That is, the level before a client connects to any of your servers. DNS essentially converts your domain name (e.g. www.example.com) into the IP address of your server(s). By monitoring applications and altering DNS dynamically so clients are pointed to different IP addresses, you can control traffic fairly easily and inexpensively. However, DNS failover does have two notable limitations:

  • DNS Failover does not fix an outage when a client is already connected to an application. This is due to the fact that their browser may not query DNS again for quite some time.
  • DNS Failover has a TTL cache issue that could take anywhere from 1 to 30 minutes or more for the IP address change to be visible around the world. This is due to the fact that many ISP’s recursive DNS servers cache longer than required in order to reduce traffic.

DNS Failover has been around for quite some time and is reliable. Our cloud portal allows you to specify any number of IP addresses as primary, secondary, tertiary etc. whereby the DNS system will only provide the IP for the server that is working and that has the highest priority (sequential mode). Alternatively, you can have it provide all of the IP addresses for all of the working servers (round robin mode) and spread traffic amongst them. In Round Robin mode, it would simply remove the server(s) that stopped working, leaving the rest alone.

Round robin mode generally only works for farms of static content where back-end synchronization does not need to take place or where sophisticated bi-directional back-end synchronization has been implemented. As a result, sequential mode is the most popular for e-commerce applications or where back-end databases exist and some type of synchronization step is required prior to failing back to the primary.

In sequential mode, the cloud network monitors all of the available servers based on the monitoring criteria you specify via the management GUI and when it detects the primary down, it automatically fails over to the secondary IP (or if the secondary is down, fails over to the tertiary etc.). In a typical static content scenario, when the primary comes back, it updates DNS again to send traffic to the primary.

However, in an application where back-end databases must be synchronized, you can easily disable auto-failback in order to prevent the primary server from receiving traffic again. The monitoring system will still alert you that it is back online, but you have to manually force it to start receiving traffic again. This allows you to perform whatever tasks are required in order to ensure that the primary has the latest copy of the database or whatever is required.

One critical drawback to using DNS failover in a situation where back-end synchronization is critical, is when the primary is only down for a very short period of time. The short period would be sufficient to trigger a failover, but insufficient so that existing client connections (or new connections where the TTL cache has not expired) permit some clients to still hit the primary site. In this case, both the primary and secondary sites could receive traffic and create completely different databases that are very difficult to synchronize. Or, in the case where the primary always has a one-way write to the secondary, the primary starts overwriting new transactions on the secondary because it has no way of detecting new entries on the secondary database due to the one-way architecture.

2. The Cloud Load Balancing / Cloud Failover solution

Cloud Failover is designed to operate at the network layer, after DNS but before clients connect to the application. It works in a very similar way to DNS Failover in that it monitors active server IP addresses based on the monitoring criteria specified in the cloud interface, but instead of dynamically updating DNS, it acts more like a traditional hardware load balancer in directing traffic.

With Cloud Failover, Total Uptime provides you with an IP address on our cloud network to announce via DNS. The IP address is extremely available and never goes down because it is announced globally on our IP Anycast network. This IP is then used in DNS as the ‘A’, ‘AAAA’, or other record. Because the IP address will never change, DNS TTL is not a factor, and even more importantly TTL can be increased significantly to increase caching and reduce query count.

When the Cloud IP receives traffic, it uses the load balancing technology to simply route the packets to the actual server IP address behind the scenes based on the configuration set. Because it is aggressively monitoring the traffic, in the event of an outage, it immediately directs traffic to the secondary, tertiary or other server as configured in the cloud management interface. Because the load balancer is in control of the traffic and no DNS updates are made to accomplish the switchover, zero traffic continues through to the downed server. As a result, even if the downed server returns online in short order, if failback disallow is configured, it will continue to remain offline receiving no traffic until manually reconfigured as primary. This architecture ensures the highest level of database integrity. Now, all that needs to take place in order to return the primary server back online is back-end synchronization and a manual switch in the Cloud management interface.

So when is DNS Failover better than Cloud Failover or Cloud Failover better than DNS Failover?

Well, it depends on your application and how sensitive you are to traffic being sent to the wrong server.  Here are a few helpful bullet points for each:

DNS Failover

  • Slightly lower cost than Cloud Failover
  • Ideally suited for static content where server synchronization isn’t required
  • Requires a DNS change to redirect traffic and for that change to propagate the Internet
  • Allows for manual or automatic failback
  • Allows for sequential or round-robin mode
  • Allows for multiple levels to handle cascading failures

Cloud Failover

  • Slightly higher cost than DNS Failover
  • Does not require DNS changes to be made and to propagate the Internet
  • Ideally suited for content where server synchronization is important
  • Allows for manual or automatic failback
  • Allows for sequential or load balanced mode
  • Provides primary and secondary server ‘pools’ (instead of just a single IP)
  • Allows multiple pool members to be weighted for precise traffic distribution

 

There you have it. The choice is yours. Hopefully the descriptions have been helpful to determine which solution might be a better fit. If you have any further questions about either solution or which might be best for your specific application, please contact us! We’re happy to talk about each one in further detail. You can also learn more here about DNS Failover and Cloud Failover.

Prevent your next outage now!

TRY IT FREE

Other articles you might like to read: