REL 4: How do you design interactions in a distributed system to prevent failures?

Distributed systems rely on communications networks to interconnect components, such as servers or services. Your workload must operate reliably despite data loss or latency in these networks. Components of the distributed system must operate in a way that does not negatively impact other components or the workload. These best practices prevent failures and improve mean time between failures (MTBF).

Resources

AWS re:Invent 2019: Moving to event-driven architectures (SVS308)
AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ARC337 (includes loose coupling, constant work, static stability)
AWS New York Summit 2019: Intro to Event-driven Architectures and Amazon EventBridge (MAD205)
What Is Amazon EventBridge?
What Is Amazon Simple Queue Service?
Amazon EC2: Ensuring Idempotency
The Amazon Builders' Library: Challenges with distributed systems

Best Practices:

Improvement Plan

Identify which kind of distributed system is required

  • Identify which kind of distributed system is required: Challenges with distributed systems involved latency, scaling, understanding networking APIs, marshalling and unmarshalling data, and the complexity of algorithms such as Paxos. As the systems grow larger and more distributed, what had been theoretical edge cases turn into regular occurrences.
    The Amazon Builders' Library: Challenges with distributed systems
  • Implement loosely coupled dependencies

  • Implement loosely coupled dependencies: Dependencies such as queuing systems, streaming systems, workflows, and load balancers are loosely coupled. Loose coupling helps isolate behavior of a component from other components that depend on it, increasing resiliency and agility.
    AWS re:Invent 2019: Moving to event-driven architectures (SVS308)
    What Is Amazon EventBridge?
    What Is Amazon Simple Queue Service?
  • Make all responses idempotent

  • Make all responses idempotent: An idempotent service promises that each request is completed exactly once, such that making multiple identical requests has the same effect as making a single request.
  • Do constant work

  • Do constant work: Systems can fail when there are large, rapid changes in load
    AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ARC337 (includes constant work)