REL 4: How do you design interactions in a distributed system to prevent failures?
Distributed systems rely on communications networks to interconnect components, such as servers or services. Your workload must operate reliably despite data loss or latency in these networks. Components of the distributed system must operate in a way that does not negatively impact other components or the workload. These best practices prevent failures and improve mean time between failures (MTBF).
Resources
AWS re:Invent 2019: Moving to event-driven architectures (SVS308)
AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems,
Big and Small ARC337 (includes loose coupling, constant work, static stability)
AWS New York Summit 2019: Intro to Event-driven Architectures and Amazon EventBridge
(MAD205)
What Is Amazon EventBridge?
What Is Amazon Simple Queue Service?
Amazon EC2: Ensuring Idempotency
The Amazon Builders' Library: Challenges with distributed systems
Best Practices:
-
Identify which kind of distributed system is required: Hard real-time distributed systems require responses to be given synchronously and rapidly, while soft real-time systems have a more generous time window of minutes or more for response. Offline systems handle responses through batch or asynchronous processing. Hard real-time distributed systems have the most stringent reliability requirements.
-
Implement loosely coupled dependencies: Dependencies such as queuing systems, streaming systems, workflows, and load balancers are loosely coupled. Loose coupling helps isolate behavior of a component from other components that depend on it, increasing resiliency and agility
-
Make all responses idempotent: An idempotent service promises that each request is completed exactly once, such that making multiple identical requests has the same effect as making a single request. An idempotent service makes it easier for a client to implement retries without fear that a request will be erroneously processed multiple times. To do this, clients can issue API requests with an idempotency token—the same token is used whenever the request is repeated. An idempotent service API uses the token to return a response identical to the response that was returned the first time that the request was completed.
-
Do constant work: Systems can fail when there are large, rapid changes in load. For example, a health check system that monitors the health of thousands of servers should send the same size payload (a full snapshot of the current state) each time. Whether no servers are failing, or all of them, the health check system is doing constant work with no large, rapid changes.
Improvement Plan
Identify which kind of distributed system is required
The Amazon Builders' Library: Challenges with distributed systems
- Hard real-time distributed systems require responses to be given synchronously and rapidly.
- Soft real-time systems have a more generous time window of minutes or greater for response
- Offline systems handle responses through batch or asynchronous processing.
- Hard real-time distributed systems have the most stringent reliability requirements.
Implement loosely coupled dependencies
AWS re:Invent 2019: Moving to event-driven architectures (SVS308)
What Is Amazon EventBridge?
What Is Amazon Simple Queue Service?
- Amazon EventBridge allows you to build event driven architectures, which are loosely coupled and distributed
AWS New York Summit 2019: Intro to Event-driven Architectures and Amazon EventBridge (MAD205) - If changes to one component force other components that rely on it to also change, then they are tightly coupled. Loose coupling breaks this dependency so that dependency components only need to know the versioned and published interface.
- Make component interactions asynchronous where possible. This model is suitable for any interaction
that does not need an immediate response and where an acknowledgement that a request
has been registered will suffice.
AWS re:Invent 2019: Scalable serverless event-driven applications using Amazon SQS and Lambda (API304)
Make all responses idempotent
- Clients can issue API requests with an idempotency token—the same token is used whenever
the request is repeated. An idempotent service API uses the token to return a response
identical to the response that was returned the first time that the request was completed.
Amazon EC2: Ensuring Idempotency
Do constant work
AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ARC337 (includes constant work)
- Engineer workloads so that payload sizes remain constant regardless of number of successes or failures.
- For example, if the health check system is monitoring 100,000 servers, the load on it is nominal under the normally light server failure rate. However, if a major event makes half those servers unhealthy, then the health check system would be overwhelmed trying to update notification systems and communicate state to its clients. Instead, the health check system should send the full snapshot of the current state each time. 100,000 server health states, each represented by a bit, would only be a 12.5-KB payload. Whether no servers or failing, or all of them, the health check system is doing constant work, and large, rapid changes are not a threat to the system stability. This is actually how the control plane is designed for Amazon Route 53 health checks.