Enable consistent and prompt responses to failure scenarios by documenting the investigation process in playbooks. Playbooks are the predefined steps to perform to identify an issue. The results from any process step are used to determine the next steps to take until the issue is identified or escalated.
For example, you could define a playbook for network connectivity issues to an application. The initial step might be to determine if the client is able to resolve DNS for the site. The second step might be to determine if the client can reach the host of the site. If the client is able to reach the host, the next step might be to investigate the application web service. If the client is not able to reach the site, the next step might be to determine the path the client takes to the site. Each step should help isolate the source of the issue so that ultimately it can be identified and then addressed.
Playbooks provide adequately skilled team members, who are unfamiliar with the workload, the guidance necessary to gather applicable information, identify potential sources of failure, isolate faults, and determine root cause of issues. Playbooks preserve the institutional knowledge of your organization. They ease the burden on key personnel by sharing their knowledge and enabling more team members to achieve the same outcomes.
Where to start building playbooks
- Prioritize frequently occurring issues to reduce the negative impacts on the business and operations.
- Prioritize issues with significant potential harmful impact to the business or workload to mitigate risk.
What to include in playbooks
- Document requirements to be able to execute the playbook.
- Identify required permissions.
- Identify required tools and configurations.
- Identify required network connectivity and access.
- Document constraints on the execution of the playbook.
- Identify conflicts with other business or operations activities.
- Document process steps and expected results.
- Identify process steps.
- Identify expected results.
- Identify follow-on steps based on results from process execution.
- Document escalation processes.
- Identify to whom the playbook should be escalated if the active team member is unable to successfully identify the source of the issue.
- Identify after what period of time the playbook should be escalated if the active team member has not yet successfully identified the source of the issue.
- Identify any third parties to whom escalation may occur and under what circumstances.
- Identify any necessary support information required to escalate to third parties (for example, serial numbers, support contact information, support contract information).
- Identify any decision makers and under what circumstances they should be contacted or notified as part of the execution of the process.
Implement your playbooks in code and trigger their execution
- Implement your playbooks in code where appropriate.
- Identify process steps.
- Identify expected outcomes.
- Implement your playbooks in code.
AWS Systems Manager Automation
What is AWS Lambda?
- Trigger your playbooks to execute automatically where appropriate.
- Identify monitoring tests to identify the triggering events.
- Implement monitoring tests to trigger the automated playbook execution.
Creating a CloudWatch Events rule that triggers on an event
Creating a CloudWatch Events rule that triggers on an AWS API call using AWS CloudTrail
CloudWatch Events event examples from supported services
Create and revise playbooks as appropriate
- Create playbooks for newly identified failure scenarios.
- Review the execution of playbooks.
- Identify appropriate optimizations.
- Identify required revisions.
- Update playbooks, and scripts and automation, as appropriate.