This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/reliability.html

REL 10: How do you use fault isolation to protect your workload?

Fault isolated boundaries limit the effect of a failure within a workload to a limited number of components. Components outside of the boundary are unaffected by the failure. Using multiple fault isolated boundaries, you can limit the impact on your workload.

Resources

AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)
Shuffle-sharding: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)
AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)
AWS re:Invent 2019: Innovation and operation of the AWS global network infrastructure (NET339)
What is AWS Outposts?
Global Tables: Multi-Region Replication with DynamoDB
AWS Local Zones FAQ
AWS Global Infrastructure
The Amazon Builders' Library: Workload isolation using shuffle-sharding

Best Practices:

Improvement Plan

Deploy the workload to multiple locations

  • Use multiple Availability Zones and AWS Regions: Distribute workload data and resources across multiple Availability Zones or, where necessary, across AWS Regions. These locations can be as diverse as required.
  • If your workload must be deployed to multiple Regions, choose a multi-region strategy: Most reliability needs can be met within a single AWS region using a multi Availability Zone strategy. Use a multi-region strategy when necessary to meet your business needs.
    AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)
  • Evaluate AWS Outposts for your workload: If your workload requires low latency to your on-premises data center or has local data processing requirements. Then run AWS infrastructure and services on premises using AWS Outposts
    What is AWS Outposts?
  • Determine if AWS Local Zones helps you provide service to your users: o If you have low-latency requirements, see if AWS Local Zones is located near your users. If yes, then use it to deploy workloads closer to those users
    AWS Local Zones FAQ
  • Automate recovery for components constrained to a single location

  • Implement self-healing: Deploy your instances or containers using automatic scaling when possible. If you cannot use automatic scaling, use automatic recovery for EC2 instances or implement self-healing automation based on Amazon EC2 or ECS container lifecycle events.
  • Use automatic recovery of EC2 instances for workloads that require a single instance ID address, private IP address, Elastic IP address, and instance metadata.
    Recover your instance.
  • Use EC2 instance lifecycle events or ECS events to automate self-healing where automatic scaling or EC2 recovery cannot be used.
    EC2 Auto Scaling lifecycle hooks
    Amazon ECS events
  • Use bulkhead architectures

  • Use bulkhead architectures: Like the bulkheads on a ship, this pattern ensures that a failure is contained to a small subset of requests/users so the number of impaired requests is limited, and most will continue without error. Bulkheads for data are usually called partitions or shards, while bulkheads for services are known as cells
    Shuffle-sharding: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)
    AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)
  • Evaluate cell-based architecture for your workload: In a cell-based architecture, each cell is a complete, independent instance of the service and has a fixed maximum size. As load increases, workloads grow by adding more cells. A partition key is used on incoming traffic to determine which cell will process the request. Any failure is contained to the single cell it occurs in, so that the number of impaired requests is limited as other cells continue without error. It is important to identify the proper partition key to minimize cross-cell interactions and avoid the need to involve complex mapping services in each request. Services that require complex mapping end up merely shifting the problem to the mapping services, while services that require cross-cell interactions reduce the independence of cells (and thus the assumed availability improvements of doing so).