This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/reliability.html

REL 5: How do you design interactions in a distributed system to mitigate or withstand failures?

Distributed systems rely on communications networks to interconnect components (such as servers or services). Your workload must operate reliably despite data loss or latency over these networks. Components of the distributed system must operate in a way that does not negatively impact other components or the workload. These best practices enable workloads to withstand stresses or failures, more quickly recover from them, and mitigate the impact of such impairments. The result is improved mean time to recovery (MTTR).

Resources

Retry, backoff, and jitter: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)
Error Retries and Exponential Backoff in AWS
Amazon API Gateway: Throttle API Requests for Better Throughput
The Amazon Builders' Library: Timeouts, retries, and backoff with jitter
The Amazon Builders' Library: Avoiding fallback in distributed systems
The Amazon Builders' Library: Avoiding insurmountable queue backlogs
The Amazon Builders' Library: Caching challenges and strategies
Well-Architected lab: Level 300: Implementing Health Checks and Managing Dependencies to Improve Reliability
CircuitBreaker (summarizes Circuit Breaker from “Release It!” book)
Michael Nygard “Release It! Design and Deploy Production-Ready Software”

Best Practices:

Implement graceful degradation to transform applicable hard dependencies into soft dependencies: When a component's dependencies are unhealthy, the component itself can still function, although in a degraded manner. For example, when a dependency call fails, failover to a predetermined static response.
Throttle requests: This is a mitigation pattern to respond to an unexpected increase in demand. Some requests are honored but those over a defined limit are rejected and return a message indicating they have been throttled. The expectation on clients is that they will back off and abandon the request or try again at a slower rate.
Control and limit retry calls: Use exponential backoff to retry after progressively longer intervals. Introduce jitter to randomize those retry intervals, and limit the maximum number of retries.
Fail fast and limit queues: If the workload is unable to respond successfully to a request, then fail fast. This allows the releasing of resources associated with a request, and permits the service to recover if it’s running out of resources. If the workload is able to respond successfully but the rate of requests is too high, then use a queue to buffer requests instead. However, do not allow long queues that can result in serving stale requests that the client has already given up on.
Set client timeouts: Set timeouts appropriately, verify them systematically, and do not rely on default values as they are generally set too high
Make services stateless where possible: Services should either not require state, or should offload state such that between different client requests, there is no dependence on locally stored data on disk or in memory. This enables servers to be replaced at will without causing an availability impact. Amazon ElastiCache or Amazon DynamoDB are good destinations for offloaded state.
Implement emergency levers: These are rapid processes that may mitigate availability impact on your workload. They can be operated in the absence of a root cause. An ideal emergency lever reduces the cognitive burden on the resolvers to zero by providing fully deterministic activation and deactivation criteria. Example levers include blocking all robot traffic or serving a static response. Levers are often manual, but they can also be automated.

Improvement Plan

Implement graceful degradation to transform applicable hard dependencies into soft dependencies

Implement graceful degradation to transform applicable hard dependencies into soft dependencies: When a component's dependencies are unhealthy, the component itself can still function, although in a degraded manner. For example, when a dependency call fails, failover to a predetermined static response.

By returning a static response, your workload mitigates failures that occur in its dependencies
Well-Architected lab: Level 300: Implementing Health Checks and Managing Dependencies to Improve Reliability
Detect when the retry operation is likely to fail, and prevent your client from making failed calls with the circuit breaker pattern
CircuitBreaker

Throttle requests

Throttle requests: This is a mitigation pattern to respond to an unexpected increase in demand. Some requests are honored but those over a defined limit are rejected and return a message indicating they have been throttled. The expectation on clients is that they will back off and abandon the request or try again at a slower rate.

Use Amazon API Gateway
Throttle API Requests for Better Throughput

Control and limit retry calls

Control and limit retry calls: Use exponential backoff to retry after progressively longer intervals. Introduce jitter to randomize those retry intervals, and limit the maximum number of retries.
Error Retries and Exponential Backoff in AWS

Amazon SDKs implement this by default. Implement similar logic in your dependency layer when calling your own dependent services. Decide what the timeouts are and when to stop retrying based on your use case.

Fail fast and limit queues

Fail fast and limit queues: If the workload is unable to respond successfully to a request, then fail fast. This allows the releasing of resources associated with a request, and permits the service to recover if it’s running out of resources. If the workload is able to respond successfully but the rate of requests is too high, then use a queue to buffer requests instead. However, do not allow long queues that can result in serving stale requests that the client has already given up on.

Implement fail fast when service is under stress
Fail Fast
Limit queues: In a queue-based system, when processing stops but messages keep arriving, the message debt can accumulate into a large backlog, driving up processing time. Work can be completed too late for the results to be useful, essentially causing the availability hit that queueing was meant to guard against.
The Amazon Builders' Library: Avoiding insurmountable queue backlogs

Set client timeouts

Set both a connection timeout and a request timeout on any remote call, and generally on any call across processes: Many frameworks offer built-in timeout capabilities, but be careful as many have default values that are infinite or too high. A value that is too high reduces the usefulness of the timeout because resources continue to be consumed while the client waits for the timeout to occur. A too low value can generate increased traffic on the backend and increased latency because too many requests are retried. In some cases, this can lead to complete outages because all requests are being retried.
AWS SDK: Retries and Timeouts

Make services stateless where possible

Make your applications stateless: Stateless applications enable horizontal scaling and are tolerant to the failure of an individual node.

Remove state that could actually be stored in request parameters.
- Some data (like cookies) can be passed in headers or query parameters.
- Refactor to remove state that can be quickly passed in requests.
After examining whether the state is required, move any state tracking to a resilient Multi-zone cache or data store like Amazon ElastiCache, Amazon RDS, Amazon DynamoDB, or a third-party distributed data solution: Store a state that could not be moved to resilient data stores.
- Some data may not actually be needed per request and can be retrieved on demand.
- Remove data that can be asynchronously retrieved.
- Decide on a data store that meets the requirements for a required state.
- Consider a NoSQL database for non-relational data.

Implement emergency levers

Implement emergency levers: These are rapid processes that may mitigate availability impact on your workload. They can be operated in the absence of a root cause. An ideal emergency lever reduces the cognitive burden on the resolvers to zero by providing fully deterministic activation and deactivation criteria. Example levers include blocking all robot traffic or serving a static response. Levers are often manual, but they can also be automated

Tips for implementing and using emergency levers
- When levers are activated, do LESS, not more
- Keep it simple, avoid bimodal behavior
- Test your levers periodically
These are examples of actions that are NOT emergency levers
- Add capacity
- Call up service owners of clients that depend on your service and ask them to reduce calls
- Making a change to code and releasing it