This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/reliability.html

REL 11: How do you design your workload to withstand component failures?

Workloads with a requirement for high availability and low mean time to recovery (MTTR) must be architected for resiliency.

Resources

Static stability in AWS: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)
AWS OpsWorks: Using Auto Healing to Replace Failed Instances
What Is Amazon EventBridge?
Amazon Route 53: Choosing a Routing Policy
What Is AWS Global Accelerator?
The Amazon Builders' Library: Static stability using Availability Zones
The Amazon Builders' Library: Implementing health checks
Well-Architected lab: Level 300: Implementing Health Checks and Managing Dependencies to Improve Reliability
The Berkeley/Stanford Recovery-Oriented Computing (ROC) Project
Multiple data center HA network connectivity
AWS Marketplace: products that can be used for fault tolerance
APN Partner: partners that can help with automation of your fault tolerance

Best Practices:

Improvement Plan

Monitor all components of the workload to detect failures

  • Determine the collection interval for your components based on your recovery goals.
  • Configure detailed monitoring for components.
  • Create custom metrics to measure business Key Performance Indicators (KPIs) : Workloads implement key business functions. These functions should be used as KPIs that help identify when an indirect problem happens.
    Publishing Custom Metrics
  • Monitor the user experience for failures using user canaries: Synthetic transaction testing (also known as "canary testing", but not to be confused with canary deployments) that can run and simulate customer behavior is among the most important testing processes. Run these tests constantly against your workload endpoints from diverse remote locations.
    Amazon CloudWatch Synthetics enables you to create user canaries
  • Create custom metrics that track the user's experience: If you can instrument the experience of the customer, you can determine when the consumer experience degrades.
    Publishing Custom Metrics
  • Set alarms to detect when any part of your workload is not working properly, and to indicate when to Auto Scale resources. : Alarms can be visually displayed on dashboards, send alerts via SNS or email, and work with Auto Scaling to scale up or down the resources for a workload.
    Using Amazon CloudWatch Alarms
  • Create dashboards to visualize your metrics: Dashboards can be used to visually see trends, outliers, and other indicators of potential problems, or to provide an indication of problems you may want to investigate.
    Using CloudWatch Dashboards
  • Fail over to healthy resources

  • Fail over to healthy resources: Ensure that if a resource failure occurs, that healthy resources can continue to serve requests. For location failures (such as Availability Zone or AWS Region) ensure you have systems in place to fail over to healthy resources in unimpaired locations.
  • Automate healing on all layers

  • Use Auto Scaling groups to deploy tiers in an Application: Auto scaling can perform self-healing on stateless applications, and add and remove capacity.
    How AWS Auto Scaling Works
  • Implement automatic recovery on EC2 instances that have applications deployed that cannot be deployed in multiple locations, and can tolerate rebooting upon failures. : Automatic recovery can be used to replace failed hardware and restart the instance when the application is not capable of being deployed in multiple locations. The instance metadata and associated IP addresses are kept, as well as the Amazon EBS volumes and mount points to Elastic File Systems or File Systems for Lustre and Windows.
    Amazon EC2 Automatic Recovery
    Amazon Elastic Block Store (Amazon EBS)
    Amazon Elastic File System (Amazon EFS)
    What is Amazon FSx for Lustre?
    What is Amazon FSx for Windows File Server?
  • Implement automated recovery using AWS Step Functions and AWS Lambda when you cannot use automatic scaling or automatic recovery, or when automatic recovery fails. : When you cannot use automatic scaling, and either cannot use automatic recovery or automatic recovery fails, you can automate the healing using AWS Step Functions and AWS Lambda.
    What is AWS Step Functions?
    What is AWS Lambda?
  • Use static stability to prevent bimodal behavior

  • Use static stability to prevent bimodal behavior: Bimodal behavior is when your workload exhibits different behavior under normal and failure modes, for example, relying on launching new instances if an Availability Zone fails
    The Amazon Builders' Library: Static stability using Availability Zones
    Static stability in AWS: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)
  • Send notifications when events impact availability

  • Alarms on business Key Performance Indicators when they exceed a low threshold: Having a low threshold alarm on your business KPIs help you know when your workload is unavailable or non-functional.
    Creating a CloudWatch Alarm Based on a Static Threshold
  • Alarm on events that invoke healing automation: You can directly invoke an SNS API to send notifications with any automation that you create.
    What is Amazon Simple Notification Service?