This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/reliability.html

REL 10: How do you use fault isolation to protect your workload?

Fault isolated boundaries limit the effect of a failure within a workload to a limited number of components. Components outside of the boundary are unaffected by the failure. Using multiple fault isolated boundaries, you can limit the impact on your workload.

Resources

AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)
Shuffle-sharding: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)
AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)
AWS re:Invent 2019: Innovation and operation of the AWS global network infrastructure (NET339)
What is AWS Outposts?
Global Tables: Multi-Region Replication with DynamoDB
AWS Local Zones FAQ
AWS Global Infrastructure
The Amazon Builders' Library: Workload isolation using shuffle-sharding

Best Practices:

Deploy the workload to multiple locations: Distribute workload data and resources across multiple Availability Zones or, where necessary, across AWS Regions. These locations can be as diverse as required.
Automate recovery for components constrained to a single location: If components of the workload can only run in a single Availability Zone or on-premises data center, you must implement the capability to do a complete rebuild of the workload within your defined recovery objectives.
Use bulkhead architectures: Like the bulkheads on a ship, this pattern ensures that a failure is contained to a small subset of requests/users so the number of impaired requests is limited, and most can continue without error. Bulkheads for data are usually called partitions or shards, while bulkheads for services are known as cells.

Improvement Plan

Deploy the workload to multiple locations

Use multiple Availability Zones and AWS Regions: Distribute workload data and resources across multiple Availability Zones or, where necessary, across AWS Regions. These locations can be as diverse as required.

Regional services are inherently deployed across Availability Zones.
- This includes Amazon S3, Amazon DynamoDB, and AWS Lambda (when not connected to a VPC)
Deploy your container, instance, and function-based workloads into multiple Availability Zones. Use Multi-zone datastores, including caches: Use the features of EC2 Auto Scaling, ECS task placement, AWS Lambda function configuration when running in your VPC, and ElastiCache clusters.
- Use subnets that are in separate Availability Zones when you deploy Auto Scaling groups.
  Example: Distributing instances across Availability Zones
  Amazon ECS task placement strategies
  Configuring an AWS Lambda function to access resources in an Amazon VPC
  Choosing Regions and Availability Zones
- Use subnets in separate Availability Zones when you deploy Auto Scaling groups.
  Example: Distributing instances across Availability Zones
- Use ECS task placement parameters, specifying DB subnet groups.
  Amazon ECS task placement strategies
- Use subnets in multiple Availability Zones when you configure a function to run in your VPC.
  Configuring an AWS Lambda function to access resources in an Amazon VPC
- Use multiple Availability Zones with ElastiCache clusters.
  Choosing Regions and Availability Zones

If your workload must be deployed to multiple Regions, choose a multi-region strategy: Most reliability needs can be met within a single AWS region using a multi Availability Zone strategy. Use a multi-region strategy when necessary to meet your business needs.
AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)

Backup to another AWS region can add another layer of assurance that data will be available when needed.
Some workloads have regulatory requirements that require use of a multi-region strategy

Evaluate AWS Outposts for your workload: If your workload requires low latency to your on-premises data center or has local data processing requirements. Then run AWS infrastructure and services on premises using AWS Outposts
What is AWS Outposts?

Determine if AWS Local Zones helps you provide service to your users: o If you have low-latency requirements, see if AWS Local Zones is located near your users. If yes, then use it to deploy workloads closer to those users
AWS Local Zones FAQ

Automate recovery for components constrained to a single location

Implement self-healing: Deploy your instances or containers using automatic scaling when possible. If you cannot use automatic scaling, use automatic recovery for EC2 instances or implement self-healing automation based on Amazon EC2 or ECS container lifecycle events.

Use Auto Scaling groups for instances and container workloads that have no requirements for a single instance IP address, private IP address, Elastic IP address, and instance metadata.
What Is EC2 Auto Scaling?
Service automatic scaling
- The launch configuration user data can be used to implement automation that can self-heal most workloads.
Use automatic recovery of EC2 instances for workloads that require a single instance ID address, private IP address, Elastic IP address, and instance metadata.
Recover your instance.
- Automatic Recovery will send recovery status alerts to a SNS topic as the instance failure is detected.
Use EC2 instance lifecycle events, or ECS events, to automate self-healing where automatic scaling or EC2 recovery cannot be used.
EC2 Auto Scaling lifecycle hooks
Amazon ECS events
- Use the events to invoke automation that will heal your component according to the process logic you require.

Use automatic recovery of EC2 instances for workloads that require a single instance ID address, private IP address, Elastic IP address, and instance metadata.
Recover your instance.

Automatic Recovery will send recovery status alerts to a SNS topic as the instance failure is detected.

Use EC2 instance lifecycle events or ECS events to automate self-healing where automatic scaling or EC2 recovery cannot be used.
EC2 Auto Scaling lifecycle hooks
Amazon ECS events

Use the events to invoke automation that will heal your component according to the process logic you require.

Use bulkhead architectures

Use bulkhead architectures: Like the bulkheads on a ship, this pattern ensures that a failure is contained to a small subset of requests/users so the number of impaired requests is limited, and most will continue without error. Bulkheads for data are usually called partitions or shards, while bulkheads for services are known as cells
Shuffle-sharding: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)
AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)

Evaluate cell-based architecture for your workload: In a cell-based architecture, each cell is a complete, independent instance of the service and has a fixed maximum size. As load increases, workloads grow by adding more cells. A partition key is used on incoming traffic to determine which cell will process the request. Any failure is contained to the single cell it occurs in, so that the number of impaired requests is limited as other cells continue without error. It is important to identify the proper partition key to minimize cross-cell interactions and avoid the need to involve complex mapping services in each request. Services that require complex mapping end up merely shifting the problem to the mapping services, while services that require cross-cell interactions reduce the independence of cells (and thus the assumed availability improvements of doing so).

In his AWS blog post, Colm MacCarthaigh explains how Amazon Route 53 uses the concept of shuffle sharding to isolate customer requests into shards
Shuffle Sharding: Massive and Magical Fault Isolation