REL 12: How do you test reliability?
After you have designed your workload to be resilient to the stresses of production, testing is the only way to ensure that it will operate as designed, and deliver the resiliency you expect.
Resources
Well-Architected lab: Level 300: Testing for Resiliency of EC2 RDS and S3
Injecting Chaos to Amazon EC2 using AWS Systems Manager
Resilience Engineering: Learning to Embrace Failure
AWS re:Invent 2019: Improving resiliency with chaos engineering (DOP309-R1)
Continuous Delivery and Continuous Integration
Using Canaries (Amazon CloudWatch Synthetics)
Use CodePipeline with AWS CodeBuild to test code and run builds
Automate your operational playbooks with AWS Systems Manager
Principles of Chaos Engineering
Apache JMeter
Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, Ali Basiri. “Chaos
Engineering” (August 2017)
AWS Marketplace: products that can be used for continuous integration
APN Partner: partners that can help with implementation of a continuous integration
pipeline
Best Practices:
-
Use playbooks to investigate failures: Enable consistent and prompt responses to failure scenarios that are not well understood, by documenting the investigation process in playbooks. Playbooks are the predefined steps performed to identify the factors contributing to a failure scenario. The results from any process step are used to determine the next steps to take until the issue is identified or escalated.
-
Perform post-incident analysis: Review customer-impacting events, and identify the contributing factors and preventative action items. Use this information to develop mitigations to limit or prevent recurrence. Develop procedures for prompt and effective responses. Communicate contributing factors and corrective actions as appropriate, tailored to target audiences. Have a method to communicate these causes to others as needed.
-
Test functional requirements: These include unit tests and integration tests that validate required functionality.
-
Test scaling and performance requirements: This includes load testing to validate that the workload meets scaling and performance requirements.
-
Test resiliency using chaos engineering: Run tests that inject failures regularly into pre-production and production environments. Hypothesize how your workload will react to the failure, then compare your hypothesis to the testing results and iterate if they do not match. Ensure that production testing does not impact users.
-
Conduct game days regularly: Use game days to regularly exercise your failure procedures as close to production as possible (including in production environments) with the people who will be involved in actual failure scenarios. Game days enforce measures to ensure that production testing does not impact users.
Improvement Plan
Use playbooks to investigate failures
- Implement playbooks as code: Perform your operations as code by scripting your playbooks to ensure consistency and limit reduce errors caused by manual processes. Playbooks can be composed of multiple scripts representing the different
steps that might be necessary to identify the contributing
factors to an issue. Runbook activities can be triggered or performed as part of playbook activities, or may prompt for execution of a playbook in response to identified events.
Automate your operational playbooks with AWS Systems Manager
AWS Systems Manager Run Command
AWS Systems Manager Automation
What is AWS Lambda?
What Is Amazon EventBridge?
Using Amazon CloudWatch Alarms
Perform post-incident analysis
- Ensure that the contributing factors are honest and blame free.
- If you do not document your problems, you cannot correct them.
- Ensure post-incident analysis is blame free so you can be dispassionate about the proposed corrective actions and promote honest self-assessment and collaboration on your application teams.
What is log analytics?
Test functional requirements
Use CodePipeline with AWS CodeBuild to test code and run builds
AWS CodePipeline Adds Support for Unit and Custom Integration Testing with AWS CodeBuild
Continuous Delivery and Continuous Integration
Using Canaries (Amazon CloudWatch Synthetics)
Software test automation
Test scaling and performance requirements
Distributed Load Testing on AWS: simulate thousands of connected users
Apache JMeter
- Deploy your application in an environment identical to your production
environment and execute a load test.
- Use infrastructure as code concepts to create an environment as similar to your production environment as possible.
Test resiliency using chaos engineering
Principles of Chaos Engineering
Well-Architected lab: Level 300: Testing for Resiliency of EC2 RDS and S3
Injecting Chaos to Amazon EC2 using AWS Systems Manager
AWS re:Invent 2019: Improving resiliency with chaos engineering (DOP309-R1)
- To inject fault into your workload use open source software
The Chaos ToolKit
Shopify Toxiproxy
Netflix Chaos Monkey - Or use commercial software available through AWS Marketplace
Gremlin - Or create your own failure injection code
Well-Architected lab: Level 300: Testing for Resiliency of EC2 RDS and S3
Injecting Chaos to Amazon EC2 using AWS Systems Manager - Test the failure of all components and external dependencies.
- Simulate conditions that can produce brownouts using extensions to common proxies to introduce latency and dropped messages. You can also create your own implementations to create brownout conditions.
Conduct game days regularly
- Execute your load or performance tests and then execute your failure injection.
- Look for anomalies in your runbooks and opportunities to exercise your playbooks.