REL 10: How do you use fault isolation to protect your workload?
Fault isolated boundaries limit the effect of a failure within a workload to a limited number of components. Components outside of the boundary are unaffected by the failure. Using multiple fault isolated boundaries, you can limit the impact on your workload.
Resources
AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications
(ARC209-R2)
Shuffle-sharding: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)
AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)
AWS re:Invent 2019: Innovation and operation of the AWS global network infrastructure
(NET339)
What is AWS Outposts?
Global Tables: Multi-Region Replication with DynamoDB
AWS Local Zones FAQ
AWS Global Infrastructure
The Amazon Builders' Library: Workload isolation using shuffle-sharding
Best Practices:
-
Deploy the workload to multiple locations: Distribute workload data and resources across multiple Availability Zones or, where necessary, across AWS Regions. These locations can be as diverse as required.
-
Automate recovery for components constrained to a single location: If components of the workload can only run in a single Availability Zone or on-premises data center, you must implement the capability to do a complete rebuild of the workload within your defined recovery objectives.
-
Use bulkhead architectures: Like the bulkheads on a ship, this pattern ensures that a failure is contained to a small subset of requests/users so the number of impaired requests is limited, and most can continue without error. Bulkheads for data are usually called partitions or shards, while bulkheads for services are known as cells.
Improvement Plan
Deploy the workload to multiple locations
- Regional services are inherently deployed across Availability Zones.
- This includes Amazon S3, Amazon DynamoDB, and AWS Lambda (when not connected to a VPC)
- Deploy your container, instance, and function-based workloads into multiple Availability Zones. Use Multi-zone datastores, including caches: Use the features of EC2 Auto Scaling, ECS task placement, AWS Lambda function configuration when running in your VPC, and ElastiCache
clusters.
- Use subnets that are in separate Availability Zones when you deploy Auto Scaling groups.
Example: Distributing instances across Availability Zones
Amazon ECS task placement strategies
Configuring an AWS Lambda function to access resources in an Amazon VPC
Choosing Regions and Availability Zones - Use subnets in separate Availability Zones when you deploy Auto Scaling groups.
Example: Distributing instances across Availability Zones - Use ECS task placement parameters, specifying DB subnet groups.
Amazon ECS task placement strategies - Use subnets in multiple Availability Zones when you configure a function to run in your VPC.
Configuring an AWS Lambda function to access resources in an Amazon VPC - Use multiple Availability Zones with ElastiCache clusters.
Choosing Regions and Availability Zones
- Use subnets that are in separate Availability Zones when you deploy Auto Scaling groups.
AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)
- Backup to another AWS region can add another layer of assurance that data will be available when needed.
- Some workloads have regulatory requirements that require use of a multi-region strategy
What is AWS Outposts?
AWS Local Zones FAQ
Automate recovery for components constrained to a single location
- Use Auto Scaling groups for instances and container workloads that have no requirements for a single instance IP address,
private IP address, Elastic IP address, and instance metadata.
What Is EC2 Auto Scaling?
Service automatic scaling- The launch configuration user data can be used to implement automation that can self-heal most workloads.
- Use automatic recovery of EC2 instances for workloads that require a single instance ID address, private IP address,
Elastic IP address, and instance metadata.
Recover your instance.- Automatic Recovery will send recovery status alerts to a SNS topic as the instance failure is detected.
- Use EC2 instance lifecycle events, or ECS events, to automate self-healing where automatic scaling or EC2 recovery cannot be used.
EC2 Auto Scaling lifecycle hooks
Amazon ECS events
Recover your instance.
- Automatic Recovery will send recovery status alerts to a SNS topic as the instance failure is detected.
EC2 Auto Scaling lifecycle hooks
Amazon ECS events
Use bulkhead architectures
Shuffle-sharding: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)
AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)
- In his AWS blog post, Colm MacCarthaigh explains how Amazon Route 53 uses the concept
of shuffle sharding to isolate customer requests into shards
Shuffle Sharding: Massive and Magical Fault Isolation