An application or service is really only good to a consumer if it is functional and available for use. As much as we may wish they wouldn’t, IT systems break, go offline, and thus become unavailable. With this in mind, we know that if our business and customers cannot tolerate certain amounts of downtime, we have to plan for that and implement systems that can withstand certain levels of adversity. As with anything in IT, knowing the business level requirements is highly important so that you can design systems that meet and exceed expectations. The collective “we” have been designing and implementing redundant and highly available systems in on-premises data centers for years. The problem of things breaking does not go away in the cloud. Even in the cloud, we still need to understand our requirements and design, build, and deploy systems with redundancy, high availability, and disaster recovery in mind. In the rest of this post, I will do my best to give my interpretation of the definitions of these concepts.
If nothing ever broke down or had major issues, then we would not really need to worry about redundancy. Redundancy means that you are taking potential failures into account and removing single points of failure in the environment. The concept is that if one device or component goes down, there is another such device or component ready to carry the load, preventing large, lengthy downtimes and service interruptions. Given my background, I will give a networking example. In traditional Layer 2 campus networks, within a building we may have access layer switches where endpoints attach to the network, then connected to a distribution layer that aggregates the access layer. If business requirements suggest that there is a low desire for downtime, and budget allows, we will most likely have two switches in that distribution layer. Each access layer switch will have at least one link to each distribution switch. If there is a link failure between the access and distribution layers, or if one of the distribution switches goes down, the other is still available to pass traffic. Now, just because there is redundancy, does not mean there is no downtime. In the traditional Layer 2 network scenario, if a distribution switch goes down, some clients may still experience a service interruption while the Spanning Tree Protocol does its thing and the network re-converges. However, that downtime should be relatively brief and much better than if there was only one distribution switch and clients/devices were down until the switch could be repaired or replaced. If we want to further lessen the impact of device/system failure, we may need to take redundancy a step further and investigate high availability options.
The goal with high availability is to reduce the impact of a failure as much as possible. In my experience, devices configured in HA share configuration and operational state information so that if one device goes online, the other takes over immediately. I have come across two main different types of HA systems, active/standby and active/active. With active/standby systems, both devices are synced, but only one (the active) is handling traffic. The active and standby are constantly communicating so that the standby knows if the active goes down. Once that happens, the standby takes over immediately, limiting the amount of impact and downtime. In the active/active scenarios that I have seen (and think I understand), both devices actively pass traffic for the data plane, but only one device handles the control plane functions. If the active control plane device fails, the standby takes over. Because the standby device was actively participating in the data plane before the failure, impact should be limited (Note: there are actions you can take with network devices to further speed up the control plane switchover to reduce that control plane impact as well.)
While organizations can go through due-diligence in designing redundant, highly available systems, they still need to plan for failures, outages, and disasters. That is where disaster recovery enters the picture. At a high level, disaster recovery involves having specific plans and procedures for restoring services and operations after an outage or disaster. DR plans really need to be catered to the needs and requirements of the business. Two tools to help us in the disaster recovery planning process are recovery time objective (RTO) and recovery point objective (RPO). These will be covered in a subsequent post.
Being ready for anything is a tall order and probably not feasible. What we can do is be as prepared as possible for adversity and things to go wrong (as they will). An important step is working with business leaders to understand their operations and what the availability requirements are for the different systems in the organization. Once those requirements are known and understood, we can better serve our organizations through the concepts of redundancy, high availability, and disaster recovery.