top of page

The Need for Fault Tolerance and High Availability


“The ability for a system to always be functioning and accessible, with minimal downtime”

For any business, it is vital for their customers to have access to their systems at all times, with no downtime. (Downtime = any time the system is not available).

For every second a customer facing system is down, it may equate to missed sales and could negatively impact the reputation of the business through lowered customer satisfaction. Even slow responses can have this impact. If a request is taking too long the customer will be dissatisfied with the service and may go elsewhere. The system must always be available and operating smoothly to facilitate business success.

High availability is not just important for customer facing systems. If an internal system is consistently down then productivity will be reduced and time will be taken from important tasks until the system is functioning again.

It is clear why High Availability is vital for systems, but how is it achieved? The answer is Fault Tolerance.


“The ability to remain fully operational, even if some components of the system fail”

© CC BY-SA 2.0: Image by devfactory

The key point to understand is that Fault Tolerance allows for High Availability in systems. The concepts are linked, but still different. Systems designed with Fault Tolerance in mind can have areas fail without impacting the functionality of the system, thus achieving High Availability.

The core principle when designing a fault tolerant system is ‘redundancy’, always having alternative servers that can continue functioning and handle traffic if another server goes down.

Hypothetically a business could have thousands of additional servers setup and dealing with requests, which would prevent any one server becoming overloaded. However, there is also other considerations. For example, how does the system know when a server is down and that it should direct traffic elsewhere, saving customers from accessing an unhealthy server and getting a failure response.

In reality the cost of this setup would be astronomical due to purchase and maintenance of such a high number of servers. It is therefore not feasible - cost is the main barrier to achieving Fault Tolerance.

Everything required for a Fault Tolerant system is possible and simplified in the cloud.

Servers can be accessed On Demand and you are only charged for the resources utilised. No longer is there a need for an extensive backup infrastructure with servers being underutilised due to the requirement of having spare capacity in the event they are required to handle the load of a failed server. Using Auto Scaling in the cloud, capacity can be increased as needed, and you only pay for the capacity used.

Health checks can be easily created to monitor the running of a server, alert when there is a failure and redirect traffic to a healthy server. This instantaneous traffic redirection means there is no downtime if a server fails. Furthermore, a new server will instantly be provisioned to replace the failed resource. This Autohealing keeps the system at the required baseline for successful functionality.

A catastrophic event such as a natural disaster could cause an entire data center full of servers to be destroyed, instantly wiping out every resource and causing irreparable failure of the system if every server was stored in this area. To avoid this in the past, a huge cost for the business would need to be incurred to spread servers across two isolated data centers, which also added extra complexity to management of the system. Within the cloud, resources can be spread across a multitude of data centers with ease and even expanded across regions, to maximise Fault Tolerance and guarantee High Availability.

For advice on making your Cloud environments fault tolerant and highly available, contact one of our Solution Architects. Email [email protected]

123 views0 comments


bottom of page