This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/reliability.html

REL 12: How do you test reliability?

After you have designed your workload to be resilient to the stresses of production, testing is the only way to ensure that it will operate as designed, and deliver the resiliency you expect.

Resources

Well-Architected lab: Level 300: Testing for Resiliency of EC2 RDS and S3
Injecting Chaos to Amazon EC2 using AWS Systems Manager
Resilience Engineering: Learning to Embrace Failure
AWS re:Invent 2019: Improving resiliency with chaos engineering (DOP309-R1)
Continuous Delivery and Continuous Integration
Using Canaries (Amazon CloudWatch Synthetics)
Use CodePipeline with AWS CodeBuild to test code and run builds
Automate your operational playbooks with AWS Systems Manager
Principles of Chaos Engineering
Apache JMeter
Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, Ali Basiri. “Chaos Engineering” (August 2017)
AWS Marketplace: products that can be used for continuous integration
APN Partner: partners that can help with implementation of a continuous integration pipeline

Best Practices:

Use playbooks to investigate failures: Enable consistent and prompt responses to failure scenarios that are not well understood, by documenting the investigation process in playbooks. Playbooks are the predefined steps performed to identify the factors contributing to a failure scenario. The results from any process step are used to determine the next steps to take until the issue is identified or escalated.
Perform post-incident analysis: Review customer-impacting events, and identify the contributing factors and preventative action items. Use this information to develop mitigations to limit or prevent recurrence. Develop procedures for prompt and effective responses. Communicate contributing factors and corrective actions as appropriate, tailored to target audiences. Have a method to communicate these causes to others as needed.
Test functional requirements: These include unit tests and integration tests that validate required functionality.
Test scaling and performance requirements: This includes load testing to validate that the workload meets scaling and performance requirements.
Test resiliency using chaos engineering: Run tests that inject failures regularly into pre-production and production environments. Hypothesize how your workload will react to the failure, then compare your hypothesis to the testing results and iterate if they do not match. Ensure that production testing does not impact users.
Conduct game days regularly: Use game days to regularly exercise your failure procedures as close to production as possible (including in production environments) with the people who will be involved in actual failure scenarios. Game days enforce measures to ensure that production testing does not impact users.

Improvement Plan

Use playbooks to investigate failures

Use playbooks to identify issues: Playbooks are documented processes to investigate issues. Enable consistent and prompt responses to failure scenarios by documenting processes in playbooks. Playbooks must contain the information and guidance necessary for an adequately skilled person to gather applicable information, identify potential sources of failure, isolate faults, and determine contributing factors (i.e. perform post-incident analysis).

Implement playbooks as code: Perform your operations as code by scripting your playbooks to ensure consistency and limit reduce errors caused by manual processes. Playbooks can be composed of multiple scripts representing the different steps that might be necessary to identify the contributing factors to an issue. Runbook activities can be triggered or performed as part of playbook activities, or may prompt for execution of a playbook in response to identified events.
Automate your operational playbooks with AWS Systems Manager
AWS Systems Manager Run Command
AWS Systems Manager Automation
What is AWS Lambda?
What Is Amazon EventBridge?
Using Amazon CloudWatch Alarms

Perform post-incident analysis

Establish a standard for your post-incident analysis: Good post-incident analysis provides opportunities to propose common solutions for problems with architecture patterns that are used in other places in your systems.

Ensure that the contributing factors are honest and blame free.
If you do not document your problems, you cannot correct them.
- Ensure post-incident analysis is blame free so you can be dispassionate about the proposed corrective actions and promote honest self-assessment and collaboration on your application teams.

Use a process to determine contributing factors: Have a process to identify and document the contributing factors of an event so that you can develop mitigations to limit or prevent recurrence and you can develop procedures for prompt and effective responses. Communicate contributing factors as appropriate, tailored to target audiences.
What is log analytics?

Test functional requirements

Test functional requirements: These include unit tests and integration tests that validate required functionality.
Use CodePipeline with AWS CodeBuild to test code and run builds
AWS CodePipeline Adds Support for Unit and Custom Integration Testing with AWS CodeBuild
Continuous Delivery and Continuous Integration
Using Canaries (Amazon CloudWatch Synthetics)
Software test automation

Test scaling and performance requirements

Test scaling and performance requirements: Perform load testing to validate that the workload meets scaling and performance requirements.
Distributed Load Testing on AWS: simulate thousands of connected users
Apache JMeter

Deploy your application in an environment identical to your production environment and execute a load test.
- Use infrastructure as code concepts to create an environment as similar to your production environment as possible.

Test resiliency using chaos engineering

Test resiliency using chaos engineering: Run tests that inject failures regularly into pre-production and production environments. Hypothesize how your workload will react to the failure, then compare your hypothesis to the testing results and iterate if they do not match. Ensure that production testing does not impact users.
Principles of Chaos Engineering
Well-Architected lab: Level 300: Testing for Resiliency of EC2 RDS and S3
Injecting Chaos to Amazon EC2 using AWS Systems Manager
AWS re:Invent 2019: Improving resiliency with chaos engineering (DOP309-R1)

To inject fault into your workload use open source software
The Chaos ToolKit
Shopify Toxiproxy
Netflix Chaos Monkey
Or use commercial software available through AWS Marketplace
Gremlin
Or create your own failure injection code
Well-Architected lab: Level 300: Testing for Resiliency of EC2 RDS and S3
Injecting Chaos to Amazon EC2 using AWS Systems Manager
Test the failure of all components and external dependencies.
- Simulate conditions that can produce brownouts using extensions to common proxies to introduce latency and dropped messages. You can also create your own implementations to create brownout conditions.

Conduct game days regularly

Schedule game days to regularly exercise your runbooks and playbooks.: Game days should involve everyone who would be involved in a production interruption: business owner, development staff, operational staff, and incident response teams.

Execute your load or performance tests and then execute your failure injection.
Look for anomalies in your runbooks and opportunities to exercise your playbooks.
- If you deviate from your runbooks, refine the runbook or correct the behavior. If you exercise your playbook, identify the runbook that should have been used, or create a new one.