This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/reliability.html

REL 11: How do you design your workload to withstand component failures?

Workloads with a requirement for high availability and low mean time to recovery (MTTR) must be architected for resiliency.

Resources

Static stability in AWS: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)
AWS OpsWorks: Using Auto Healing to Replace Failed Instances
What Is Amazon EventBridge?
Amazon Route 53: Choosing a Routing Policy
What Is AWS Global Accelerator?
The Amazon Builders' Library: Static stability using Availability Zones
The Amazon Builders' Library: Implementing health checks
Well-Architected lab: Level 300: Implementing Health Checks and Managing Dependencies to Improve Reliability
The Berkeley/Stanford Recovery-Oriented Computing (ROC) Project
Multiple data center HA network connectivity
AWS Marketplace: products that can be used for fault tolerance
APN Partner: partners that can help with automation of your fault tolerance

Best Practices:

Monitor all components of the workload to detect failures: Continuously monitor the health of your workload so that you and your automated systems are aware of degradation or complete failure as soon as they occur. Monitor for key performance indicators (KPIs) based on business value.
Fail over to healthy resources: Ensure that if a resource failure occurs, that healthy resources can continue to serve requests. For location failures (such as Availability Zone or AWS Region) ensure you have systems in place to fail over to healthy resources in unimpaired locations.
Automate healing on all layers: Upon detection of a failure, use automated capabilities to perform actions to remediate.
Use static stability to prevent bimodal behavior: Bimodal behavior is when your workload exhibits different behavior under normal and failure modes, for example, relying on launching new instances if an Availability Zone fails. You should instead build workloads that are statically stable and operate in only one mode. In this case, provision enough instances in each Availability Zone to handle the workload load if one AZ were removed and then use Elastic Load Balancing or Amazon Route 53 health checks to shift load away from the impaired instances.
Send notifications when events impact availability: Notifications are sent upon the detection of significant events, even if the issue caused by the event was automatically resolved.

Improvement Plan

Monitor all components of the workload to detect failures

Determine the collection interval for your components based on your recovery goals.

Your monitoring interval is dependent on how quickly you must recover: Your recovery time is driven by the time it takes to recover, so you must determine the frequency of collection by accounting for this time and your recovery time objective (RTO).

Configure detailed monitoring for components.

Determine if detailed monitoring for EC2 instances and Auto Scaling is necessary: Detailed monitoring provides 1-min interval metrics, and default monitoring provides 5-min interval metrics.
Enable or Disable Detailed Monitoring for Your Instance
Monitoring Your Auto Scaling Groups and Instances Using Amazon CloudWatch
Determine if enhanced monitoring for RDS is necessary: Enhanced monitoring uses an agent on the RDS instances to get useful information about different process or threads on an RDS instance.
Enhanced Monitoring

Create custom metrics to measure business Key Performance Indicators (KPIs) : Workloads implement key business functions. These functions should be used as KPIs that help identify when an indirect problem happens.
Publishing Custom Metrics

Monitor the user experience for failures using user canaries: Synthetic transaction testing (also known as "canary testing", but not to be confused with canary deployments) that can run and simulate customer behavior is among the most important testing processes. Run these tests constantly against your workload endpoints from diverse remote locations.
Amazon CloudWatch Synthetics enables you to create user canaries

Create custom metrics that track the user's experience: If you can instrument the experience of the customer, you can determine when the consumer experience degrades.
Publishing Custom Metrics

Set alarms to detect when any part of your workload is not working properly, and to indicate when to Auto Scale resources. : Alarms can be visually displayed on dashboards, send alerts via SNS or email, and work with Auto Scaling to scale up or down the resources for a workload.
Using Amazon CloudWatch Alarms

Create dashboards to visualize your metrics: Dashboards can be used to visually see trends, outliers, and other indicators of potential problems, or to provide an indication of problems you may want to investigate.
Using CloudWatch Dashboards

Fail over to healthy resources

Fail over to healthy resources: Ensure that if a resource failure occurs, that healthy resources can continue to serve requests. For location failures (such as Availability Zone or AWS Region) ensure you have systems in place to fail over to healthy resources in unimpaired locations.

If your workload is using AWS services, such as Amazon S3 or Amazon DynamoDB, then they are automatically deployed to multiple Availability Zones. In case of failure, the AWS control plane automatically routes traffic to healthy locations for you.
For Amazon RDS you must choose Multi-AZ as a configuration option, and then on failure AWS automatically directs traffic to the healthy instance.
High Availability (Multi-AZ) for Amazon RDS
For Amazon EC2 instances or Amazon ECS tasks, you choose which Availability Zones to deploy to. Elastic Load Balancing then provides the solution to detect instances in unhealthy zones and route traffic to the healthy ones. Elastic Load Balancing can even route traffic to components in your on-premises data center.
For multi-region approaches (which might also include on-premises data centers), ensure that data and resources from healthy locations can continue to serve requests
- For example, cross-region read replicas enable you to deploy your data to multiple AWS Regions, but you still must promote the read replica to master and point your traffic at it in the event of a primary location failure.
  Overview of Amazon RDS Read Replicas
- Amazon Route 53 provides a way to define internet domains, and assign routing policies, which might include health checks, to ensure that traffic is routed to healthy Regions. Alternately, AWS Global Accelerator provides static IP addresses that act as a fixed entry point to your application, then routes to endpoints in AWS Regions of your choosing, using the AWS global network instead of the public internet for better performance and reliability.
  Amazon Route 53: Choosing a Routing Policy
  What Is AWS Global Accelerator?

Automate healing on all layers

Use Auto Scaling groups to deploy tiers in an Application: Auto scaling can perform self-healing on stateless applications, and add and remove capacity.
How AWS Auto Scaling Works

Implement automatic recovery on EC2 instances that have applications deployed that cannot be deployed in multiple locations, and can tolerate rebooting upon failures. : Automatic recovery can be used to replace failed hardware and restart the instance when the application is not capable of being deployed in multiple locations. The instance metadata and associated IP addresses are kept, as well as the Amazon EBS volumes and mount points to Elastic File Systems or File Systems for Lustre and Windows.
Amazon EC2 Automatic Recovery
Amazon Elastic Block Store (Amazon EBS)
Amazon Elastic File System (Amazon EFS)
What is Amazon FSx for Lustre?
What is Amazon FSx for Windows File Server?

Using AWS OpsWorks, you can configure Auto Healing of EC2 instances at the layer level
AWS OpsWorks: Using Auto Healing to Replace Failed Instances

Implement automated recovery using AWS Step Functions and AWS Lambda when you cannot use automatic scaling or automatic recovery, or when automatic recovery fails. : When you cannot use automatic scaling, and either cannot use automatic recovery or automatic recovery fails, you can automate the healing using AWS Step Functions and AWS Lambda.
What is AWS Step Functions?
What is AWS Lambda?

Amazon EventBridge can be used to monitor and filter for events such as CloudWatch Alarms or changes in state in other AWS services. Based on event information, it can then trigger AWS Lambda (or other targets) to execute custom remediation logic on your workload.
What Is Amazon EventBridge?
Using Amazon CloudWatch Alarms

Use static stability to prevent bimodal behavior

Use static stability to prevent bimodal behavior: Bimodal behavior is when your workload exhibits different behavior under normal and failure modes, for example, relying on launching new instances if an Availability Zone fails
The Amazon Builders' Library: Static stability using Availability Zones
Static stability in AWS: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)

You should instead build systems that are statically stable and operate in only one mode. In this case, provision enough instances in each zone to handle workload load if one AZ were removed and then use Elastic Load Balancing or Amazon Route 53 health checks to shift load away from the impaired instances.
Another example of bimodal behavior is allowing clients to bypass your workload cache when failures occur. This might seem to be a solution to accommodate client needs, but should not be allowed since it significantly changes demands on your workload and is likely to result in failures.

Send notifications when events impact availability

Alarms on business Key Performance Indicators when they exceed a low threshold: Having a low threshold alarm on your business KPIs help you know when your workload is unavailable or non-functional.
Creating a CloudWatch Alarm Based on a Static Threshold

Alarm on events that invoke healing automation: You can directly invoke an SNS API to send notifications with any automation that you create.
What is Amazon Simple Notification Service?