This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/reliability.html

REL 4: How do you design interactions in a distributed system to prevent failures?

Distributed systems rely on communications networks to interconnect components, such as servers or services. Your workload must operate reliably despite data loss or latency in these networks. Components of the distributed system must operate in a way that does not negatively impact other components or the workload. These best practices prevent failures and improve mean time between failures (MTBF).

Resources

AWS re:Invent 2019: Moving to event-driven architectures (SVS308)
AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ARC337 (includes loose coupling, constant work, static stability)
AWS New York Summit 2019: Intro to Event-driven Architectures and Amazon EventBridge (MAD205)
What Is Amazon EventBridge?
What Is Amazon Simple Queue Service?
Amazon EC2: Ensuring Idempotency
The Amazon Builders' Library: Challenges with distributed systems

Best Practices:

Identify which kind of distributed system is required: Hard real-time distributed systems require responses to be given synchronously and rapidly, while soft real-time systems have a more generous time window of minutes or more for response. Offline systems handle responses through batch or asynchronous processing. Hard real-time distributed systems have the most stringent reliability requirements.
Implement loosely coupled dependencies: Dependencies such as queuing systems, streaming systems, workflows, and load balancers are loosely coupled. Loose coupling helps isolate behavior of a component from other components that depend on it, increasing resiliency and agility
Make all responses idempotent: An idempotent service promises that each request is completed exactly once, such that making multiple identical requests has the same effect as making a single request. An idempotent service makes it easier for a client to implement retries without fear that a request will be erroneously processed multiple times. To do this, clients can issue API requests with an idempotency token—the same token is used whenever the request is repeated. An idempotent service API uses the token to return a response identical to the response that was returned the first time that the request was completed.
Do constant work: Systems can fail when there are large, rapid changes in load. For example, a health check system that monitors the health of thousands of servers should send the same size payload (a full snapshot of the current state) each time. Whether no servers are failing, or all of them, the health check system is doing constant work with no large, rapid changes.

Improvement Plan

Identify which kind of distributed system is required

Identify which kind of distributed system is required: Challenges with distributed systems involved latency, scaling, understanding networking APIs, marshalling and unmarshalling data, and the complexity of algorithms such as Paxos. As the systems grow larger and more distributed, what had been theoretical edge cases turn into regular occurrences.
The Amazon Builders' Library: Challenges with distributed systems

Hard real-time distributed systems require responses to be given synchronously and rapidly.
Soft real-time systems have a more generous time window of minutes or greater for response
Offline systems handle responses through batch or asynchronous processing.
Hard real-time distributed systems have the most stringent reliability requirements.

Implement loosely coupled dependencies

Implement loosely coupled dependencies: Dependencies such as queuing systems, streaming systems, workflows, and load balancers are loosely coupled. Loose coupling helps isolate behavior of a component from other components that depend on it, increasing resiliency and agility.
AWS re:Invent 2019: Moving to event-driven architectures (SVS308)
What Is Amazon EventBridge?
What Is Amazon Simple Queue Service?

Amazon EventBridge allows you to build event driven architectures, which are loosely coupled and distributed
AWS New York Summit 2019: Intro to Event-driven Architectures and Amazon EventBridge (MAD205)
If changes to one component force other components that rely on it to also change, then they are tightly coupled. Loose coupling breaks this dependency so that dependency components only need to know the versioned and published interface.
Make component interactions asynchronous where possible. This model is suitable for any interaction that does not need an immediate response and where an acknowledgement that a request has been registered will suffice.
AWS re:Invent 2019: Scalable serverless event-driven applications using Amazon SQS and Lambda (API304)

Make all responses idempotent

Make all responses idempotent: An idempotent service promises that each request is completed exactly once, such that making multiple identical requests has the same effect as making a single request.

Clients can issue API requests with an idempotency token—the same token is used whenever the request is repeated. An idempotent service API uses the token to return a response identical to the response that was returned the first time that the request was completed.
Amazon EC2: Ensuring Idempotency

Do constant work

Do constant work: Systems can fail when there are large, rapid changes in load
AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ARC337 (includes constant work)

Engineer workloads so that payload sizes remain constant regardless of number of successes or failures.
- For example, if the health check system is monitoring 100,000 servers, the load on it is nominal under the normally light server failure rate. However, if a major event makes half those servers unhealthy, then the health check system would be overwhelmed trying to update notification systems and communicate state to its clients. Instead, the health check system should send the full snapshot of the current state each time. 100,000 server health states, each represented by a bit, would only be a 12.5-KB payload. Whether no servers or failing, or all of them, the health check system is doing constant work, and large, rapid changes are not a threat to the system stability. This is actually how the control plane is designed for Amazon Route 53 health checks.