OPS 10: How do you manage workload and operations events?

Prepare and validate procedures for responding to events to minimize their disruption to your workload.

Resources

Build a Monitoring Plan
Amazon CloudWatch Features
What is Amazon CloudWatch Events?

Best Practices:

Improvement Plan

Use processes for event, incident, and problem management

  • Use processes for event, incident, and problem management: Have processes to address observed events, events that require intervention (incidents), and events that require intervention and either recur or cannot currently be resolved (problems). Use these processes to mitigate the impact of these events on the business and your customers by ensuring timely and appropriate responses.
  • Have a process per alert

  • Process per alert: Any event for which you raise an alert should have a well defined response (runbook or playbook) with a specifically identified owner (for example, individual, team, or role) accountable for successful execution. Performance of the response may be automated or conducted by another team but the owner is accountable for ensuring the process delivers the expected outcomes. By having these processes, you ensure effective and prompt responses to operations events and you can prevent actionable events from being obscured by less valuable notifications. For example, automatic scaling might be applied to scale a web front end, but the operations team might be accountable to ensure that the automatic scaling rules and limits are appropriate for workload needs.
  • Prioritize operational events based on business impact

  • Prioritize operational events based on business impact: Ensure that when multiple events require intervention, those that are most significant to the business are addressed first. For example, impacts can include loss of life or injury, financial loss, regulatory violations, or damage to reputation or trust.
  • Define escalation paths

  • Define escalation paths: Define escalation paths in your runbooks and playbooks, including what triggers escalation, and procedures for escalation. For example, escalation of an issue from support engineers to senior support engineers when runbooks cannot resolve the issue, or when a predefined period of time has elapsed. Another example of an appropriate escalation path is from senior support engineers to the development team for a workload when the playbooks are unable to identify a path to remediation, or when a predefined period of time has elapsed. Specifically identify owners for each action to ensure effective and prompt responses to operations events. Escalations can include third parties. For example, a network connectivity provider or a software vendor. Escalations can include identified authorized decision makers for impacted systems.
  • Enable push notifications

  • Enable push notifications: Communicate directly with your users (for example, with email or SMS) when the services they use are impacted, and when the services return to normal operating conditions, to enable users to take appropriate action.
    Amazon SES features
    What is Amazon SES?
    Set up Amazon SNS notifications
  • Communicate status through dashboards

  • Communicate status through dashboards: Provide dashboards tailored to their target audiences (for example, internal technical teams, leadership, and customers) to communicate the current operating status of the business and provide metrics of interest. Providing a self-service option for status information reduces the disruption of fielding requests for status by the operations team. Examples include Amazon CloudWatch dashboards, and AWS Personal Health Dashboard.
    CloudWatch dashboards create and use customized metrics views
  • Automate responses to events

  • Automate responses to events: Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent responses.
    What is Amazon CloudWatch Events?
    Creating a CloudWatch Events rule that triggers on an event
    Creating a CloudWatch Events rule that triggers on an AWS API call using AWS CloudTrail
    CloudWatch Events event examples from supported services