Designing for failure

Assume all things will fail. Ensure you carefully review every aspect of your cloud architecture, and design for failure scenarios against each one of them. In particular, assume hardware will fail, cloud data center outages will happen, database failure or performance degradation will occur, expected volumes of transactions will be exceeded, and so on. In addition, in an auto-scaled environment, for example, nodes may be shutdown in response to load getting back to normal levels after a spike. Nodes may also be rebooted by the cloud platform. There can be unexpected application failures. In all these cases, the design goal should be to handle such error conditions gracefully, and minimize any impact to user experience.

There should be a strong preference to minimize human or manual intervention. Hence, it is better to implement strategies using services made available by the cloud platform to reduce the chances of failures or automate recovery from such failures.

Following are a list of key design principles that will help you handle failures in the cloud more effectively:

  • Store no application state on your servers because if your server gets killed then you will not lose any application state. Sessions or logging records should never be stored to local filesystem.
  • Logging should always be to a centralized location, for example, using a database or a third-party logging service. If you need to store information temporarily for subsequent processing then use the cloud platform's reliable queuing service. This is relevant not only in the case of server failures but also applicable in server scale out situations. During the scaling down process you don’t want to lose information by storing it on the local filesystem.
  • Your log records should contain additional cloud-specific information to help the debugging process, for example, instance ID, region, availability zone, tenant ID, and so on. Centralized logging across multiple tenants (in a shared everything configuration) can get voluminous. Therefore, it helps to use tools for viewing, searching, and filtering log records.
  • A request passes through numerous components (for example, network components) along its journey to the server side processing components. An error can occur anywhere or anytime during the life of the request. These errors may typically result in a server error (that is, a 5 xx series error). In such cases, it is normal for the application code to implement retry logic. The cloud provider's SDKs usually provide features that make implementing this retry logic simpler.

Remember to log your retry attempts. If you notice a high number of retry attempts then it’s a good idea to review the sizing of your infrastructure. You will most likely need to provision additional resources to reduce error or failure rates, and the resultant retry attempts. 

  • The cloud platform may restrict the number of API requests you can issue in a given time period. Hence, in addition to the total number of retries, you need to ensure you do not exceed the allowed request rates by implementing delays between your retry attempts. This is typically implemented using an exponential back-off algorithm where you progressively introduce longer delays between your retry attempts.
  • Avoid single-points-of-failure. Plan to distribute your services across multiple regions and zones (that is, different data centers in the same region). This will minimize the chances of an application outage due to failures in individual instances, an availability zone, or a region.

Sometimes running multiple instances is cost prohibitive for smaller organizations (very common for start-ups new to the cloud). If you want to run a single instance then ensure you still configure it for auto scaling. Set the minimum and maximum number of servers equal to one. This will ensure that in case your instance becomes unhealthy then the cloud service can replace it with a new instance within a few minutes of downtime.

In some cases, for example, highly interactive applications, it is best to just display a simple message to the end user to resubmit the transaction or refresh the screen (the resulting retry will likely succeed).