REL 5: How do you design interactions in a distributed system to mitigate or withstand failures?
Distributed systems rely on communications networks to interconnect components (such as servers or services). Your workload must operate reliably despite data loss or latency over these networks. Components of the distributed system must operate in a way that does not negatively impact other components or the workload. These best practices enable workloads to withstand stresses or failures, more quickly recover from them, and mitigate the impact of such impairments. The result is improved mean time to recovery (MTTR).
Resources
Retry, backoff, and jitter: AWS re:Invent 2019: Introducing The Amazon Builders’ Library
(DOP328)
Error Retries and Exponential Backoff in AWS
Amazon API Gateway: Throttle API Requests for Better Throughput
The Amazon Builders' Library: Timeouts, retries, and backoff with jitter
The Amazon Builders' Library: Avoiding fallback in distributed systems
The Amazon Builders' Library: Avoiding insurmountable queue backlogs
The Amazon Builders' Library: Caching challenges and strategies
Well-Architected lab: Level 300: Implementing Health Checks and Managing Dependencies
to Improve Reliability
CircuitBreaker (summarizes Circuit Breaker from “Release It!” book)
Michael Nygard “Release It! Design and Deploy Production-Ready Software”
Best Practices:
-
Implement graceful degradation to transform applicable hard dependencies into soft dependencies: When a component's dependencies are unhealthy, the component itself can still function, although in a degraded manner. For example, when a dependency call fails, failover to a predetermined static response.
-
Throttle requests: This is a mitigation pattern to respond to an unexpected increase in demand. Some requests are honored but those over a defined limit are rejected and return a message indicating they have been throttled. The expectation on clients is that they will back off and abandon the request or try again at a slower rate.
-
Control and limit retry calls: Use exponential backoff to retry after progressively longer intervals. Introduce jitter to randomize those retry intervals, and limit the maximum number of retries.
-
Fail fast and limit queues: If the workload is unable to respond successfully to a request, then fail fast. This allows the releasing of resources associated with a request, and permits the service to recover if it’s running out of resources. If the workload is able to respond successfully but the rate of requests is too high, then use a queue to buffer requests instead. However, do not allow long queues that can result in serving stale requests that the client has already given up on.
-
Set client timeouts: Set timeouts appropriately, verify them systematically, and do not rely on default values as they are generally set too high
-
Make services stateless where possible: Services should either not require state, or should offload state such that between different client requests, there is no dependence on locally stored data on disk or in memory. This enables servers to be replaced at will without causing an availability impact. Amazon ElastiCache or Amazon DynamoDB are good destinations for offloaded state.
-
Implement emergency levers: These are rapid processes that may mitigate availability impact on your workload. They can be operated in the absence of a root cause. An ideal emergency lever reduces the cognitive burden on the resolvers to zero by providing fully deterministic activation and deactivation criteria. Example levers include blocking all robot traffic or serving a static response. Levers are often manual, but they can also be automated.
Improvement Plan
Implement graceful degradation to transform applicable hard dependencies into soft
dependencies
- By returning a static response, your workload mitigates failures that occur in its dependencies
Well-Architected lab: Level 300: Implementing Health Checks and Managing Dependencies to Improve Reliability - Detect when the retry operation is likely to fail, and prevent your client from making
failed calls with the circuit breaker pattern
CircuitBreaker
Throttle requests
Control and limit retry calls
Error Retries and Exponential Backoff in AWS
- Amazon SDKs implement this by default. Implement similar logic in your dependency layer when calling your own dependent services. Decide what the timeouts are and when to stop retrying based on your use case.
Fail fast and limit queues
- Implement fail fast when service is under stress
Fail Fast - Limit queues: In a queue-based system, when processing stops but messages keep arriving, the message
debt can accumulate into a large backlog, driving up processing time. Work can be
completed too late for the results to be useful, essentially causing the availability hit that queueing was meant to guard against.
The Amazon Builders' Library: Avoiding insurmountable queue backlogs
Set client timeouts
AWS SDK: Retries and Timeouts
Make services stateless where possible
- Remove state that could actually be stored in request
parameters.
- Some data (like cookies) can be passed in headers or query parameters.
- Refactor to remove state that can be quickly passed in requests.
- After examining whether the state is required, move any state
tracking to a resilient Multi-zone cache or data store like Amazon ElastiCache, Amazon RDS, Amazon DynamoDB, or a third-party distributed data solution: Store a state that could not be moved to resilient data
stores.
- Some data may not actually be needed per request and can be retrieved on demand.
- Remove data that can be asynchronously retrieved.
- Decide on a data store that meets the requirements for a required state.
- Consider a NoSQL database for non-relational data.
Implement emergency levers
- Tips for implementing and using emergency levers
- When levers are activated, do LESS, not more
- Keep it simple, avoid bimodal behavior
- Test your levers periodically
- These are examples of actions that are NOT emergency levers
- Add capacity
- Call up service owners of clients that depend on your service and ask them to reduce calls
- Making a change to code and releasing it