REL 11: How do you design your workload to withstand component failures?
Workloads with a requirement for high availability and low mean time to recovery (MTTR) must be architected for resiliency.
Resources
Static stability in AWS: AWS re:Invent 2019: Introducing The Amazon Builders’ Library
(DOP328)
AWS OpsWorks: Using Auto Healing to Replace Failed Instances
What Is Amazon EventBridge?
Amazon Route 53: Choosing a Routing Policy
What Is AWS Global Accelerator?
The Amazon Builders' Library: Static stability using Availability Zones
The Amazon Builders' Library: Implementing health checks
Well-Architected lab: Level 300: Implementing Health Checks and Managing Dependencies
to Improve Reliability
The Berkeley/Stanford Recovery-Oriented Computing (ROC) Project
Multiple data center HA network connectivity
AWS Marketplace: products that can be used for fault tolerance
APN
Partner: partners that can help with automation of your fault tolerance
Best Practices:
-
Monitor all components of the workload to detect failures: Continuously monitor the health of your workload so that you and your automated systems are aware of degradation or complete failure as soon as they occur. Monitor for key performance indicators (KPIs) based on business value.
-
Fail over to healthy resources: Ensure that if a resource failure occurs, that healthy resources can continue to serve requests. For location failures (such as Availability Zone or AWS Region) ensure you have systems in place to fail over to healthy resources in unimpaired locations.
-
Automate healing on all layers: Upon detection of a failure, use automated capabilities to perform actions to remediate.
-
Use static stability to prevent bimodal behavior: Bimodal behavior is when your workload exhibits different behavior under normal and failure modes, for example, relying on launching new instances if an Availability Zone fails. You should instead build workloads that are statically stable and operate in only one mode. In this case, provision enough instances in each Availability Zone to handle the workload load if one AZ were removed and then use Elastic Load Balancing or Amazon Route 53 health checks to shift load away from the impaired instances.
-
Send notifications when events impact availability: Notifications are sent upon the detection of significant events, even if the issue caused by the event was automatically resolved.
Improvement Plan
Monitor all components of the workload to detect failures
- Your monitoring interval is dependent on how quickly you must recover: Your recovery time is driven by the time it takes to recover, so you must determine the frequency of collection by accounting for this time and your recovery time objective (RTO).
- Determine if detailed monitoring for EC2 instances and Auto Scaling is necessary: Detailed monitoring provides 1-min interval metrics, and default monitoring provides
5-min interval metrics.
Enable or Disable Detailed Monitoring for Your Instance
Monitoring Your Auto Scaling Groups and Instances Using Amazon CloudWatch - Determine if enhanced monitoring for RDS is necessary: Enhanced monitoring uses an agent on the RDS instances to get useful information about
different process or threads on an RDS instance.
Enhanced Monitoring
Publishing Custom Metrics
Amazon CloudWatch Synthetics enables you to create user canaries
Publishing Custom Metrics
Using Amazon CloudWatch Alarms
Using CloudWatch Dashboards
Fail over to healthy resources
- If your workload is using AWS services, such as Amazon S3 or Amazon DynamoDB, then they are automatically deployed to multiple Availability Zones. In case of failure, the AWS control plane automatically routes traffic to healthy locations for you.
- For Amazon RDS you must choose Multi-AZ as a configuration option, and then on failure AWS automatically directs traffic
to the healthy instance.
High Availability (Multi-AZ) for Amazon RDS - For Amazon EC2 instances or Amazon ECS tasks, you choose which Availability Zones to deploy to. Elastic Load Balancing then provides the solution to detect instances in unhealthy zones and route traffic to the healthy ones. Elastic Load Balancing can even route traffic to components in your on-premises data center.
- For multi-region approaches (which might also include on-premises data centers), ensure
that data and resources from healthy locations can continue to serve requests
- For example, cross-region read replicas enable you to deploy your data to multiple
AWS Regions, but you still must promote the read replica to master and point your traffic at
it in the event of a primary location failure.
Overview of Amazon RDS Read Replicas - Amazon Route 53 provides a way to define internet domains, and assign routing policies,
which might include health checks, to ensure that traffic is routed to healthy Regions.
Alternately, AWS Global Accelerator provides static IP addresses that act as a fixed
entry point to your application, then routes to endpoints in AWS Regions of your choosing, using the AWS global network instead of the public internet for
better performance and reliability.
Amazon Route 53: Choosing a Routing Policy
What Is AWS Global Accelerator?
- For example, cross-region read replicas enable you to deploy your data to multiple
AWS Regions, but you still must promote the read replica to master and point your traffic at
it in the event of a primary location failure.
Automate healing on all layers
How AWS Auto Scaling Works
Amazon EC2 Automatic Recovery
Amazon Elastic Block Store (Amazon EBS)
Amazon Elastic File System (Amazon EFS)
What is Amazon FSx for Lustre?
What is Amazon FSx for Windows File Server?
- Using AWS OpsWorks, you can configure Auto Healing of EC2 instances at the layer level
AWS OpsWorks: Using Auto Healing to Replace Failed Instances
What is AWS Step Functions?
What is AWS Lambda?
- Amazon EventBridge can be used to monitor and filter for events such as CloudWatch Alarms or changes in state in other AWS services. Based on event information, it can then trigger AWS Lambda (or other targets) to execute custom remediation logic on your workload.
What Is Amazon EventBridge?
Using Amazon CloudWatch Alarms
Use static stability to prevent bimodal behavior
The Amazon Builders' Library: Static stability using Availability Zones
Static stability in AWS: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)
- You should instead build systems that are statically stable and operate in only one mode. In this case, provision enough instances in each zone to handle workload load if one AZ were removed and then use Elastic Load Balancing or Amazon Route 53 health checks to shift load away from the impaired instances.
- Another example of bimodal behavior is allowing clients to bypass your workload cache when failures occur. This might seem to be a solution to accommodate client needs, but should not be allowed since it significantly changes demands on your workload and is likely to result in failures.
Send notifications when events impact availability
Creating a CloudWatch Alarm Based on a Static Threshold
What is Amazon Simple Notification Service?