This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/reliability.html

REL 12: How do you test reliability?

After you have designed your workload to be resilient to the stresses of production, testing is the only way to ensure that it will operate as designed, and deliver the resiliency you expect.

Resources

Well-Architected lab: Level 300: Testing for Resiliency of EC2 RDS and S3
Injecting Chaos to Amazon EC2 using AWS Systems Manager
Resilience Engineering: Learning to Embrace Failure
AWS re:Invent 2019: Improving resiliency with chaos engineering (DOP309-R1)
Continuous Delivery and Continuous Integration
Using Canaries (Amazon CloudWatch Synthetics)
Use CodePipeline with AWS CodeBuild to test code and run builds
Automate your operational playbooks with AWS Systems Manager
Principles of Chaos Engineering
Apache JMeter
Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, Ali Basiri. “Chaos Engineering” (August 2017)
AWS Marketplace: products that can be used for continuous integration
APN Partner: partners that can help with implementation of a continuous integration pipeline

Best Practices:

Improvement Plan

Use playbooks to investigate failures

  • Use playbooks to identify issues: Playbooks are documented processes to investigate issues. Enable consistent and prompt responses to failure scenarios by documenting processes in playbooks. Playbooks must contain the information and guidance necessary for an adequately skilled person to gather applicable information, identify potential sources of failure, isolate faults, and determine contributing factors (i.e. perform post-incident analysis).
  • Perform post-incident analysis

  • Establish a standard for your post-incident analysis: Good post-incident analysis provides opportunities to propose common solutions for problems with architecture patterns that are used in other places in your systems.
  • Use a process to determine contributing factors: Have a process to identify and document the contributing factors of an event so that you can develop mitigations to limit or prevent recurrence and you can develop procedures for prompt and effective responses. Communicate contributing factors as appropriate, tailored to target audiences.
    What is log analytics?
  • Test functional requirements

  • Test functional requirements: These include unit tests and integration tests that validate required functionality.
    Use CodePipeline with AWS CodeBuild to test code and run builds
    AWS CodePipeline Adds Support for Unit and Custom Integration Testing with AWS CodeBuild
    Continuous Delivery and Continuous Integration
    Using Canaries (Amazon CloudWatch Synthetics)
    Software test automation
  • Test scaling and performance requirements

  • Test scaling and performance requirements: Perform load testing to validate that the workload meets scaling and performance requirements.
    Distributed Load Testing on AWS: simulate thousands of connected users
    Apache JMeter
  • Test resiliency using chaos engineering

  • Test resiliency using chaos engineering: Run tests that inject failures regularly into pre-production and production environments. Hypothesize how your workload will react to the failure, then compare your hypothesis to the testing results and iterate if they do not match. Ensure that production testing does not impact users.
    Principles of Chaos Engineering
    Well-Architected lab: Level 300: Testing for Resiliency of EC2 RDS and S3
    Injecting Chaos to Amazon EC2 using AWS Systems Manager
    AWS re:Invent 2019: Improving resiliency with chaos engineering (DOP309-R1)
  • Conduct game days regularly

  • Schedule game days to regularly exercise your runbooks and playbooks.: Game days should involve everyone who would be involved in a production interruption: business owner, development staff, operational staff, and incident response teams.