OPS 10: How do you manage workload and operations events?
Prepare and validate procedures for responding to events to minimize their disruption to your workload.
Resources
Build a Monitoring Plan
Amazon CloudWatch Features
What is Amazon CloudWatch Events?
-
Use processes for event, incident, and problem management: Have processes to address observed events, events that require intervention (incidents), and events that require intervention and either recur or cannot currently be resolved (problems). Use these processes to mitigate the impact of these events on the business and your customers by ensuring timely and appropriate
responses.
-
Have a process per alert: Have a well-defined response (runbook or playbook), with a specifically identified owner, for any event for which you raise an alert. This ensures effective and prompt responses
to operations events and prevents actionable events from being obscured by less valuable notifications.
-
Prioritize operational events based on business impact: Ensure that when multiple events require intervention, those that are most significant to the business
are addressed first. For example, impacts can include loss of life
or injury, financial loss, or damage to reputation or trust.
-
Define escalation paths: Define escalation paths in your runbooks and playbooks, including what triggers escalation, and procedures
for escalation. Specifically identify owners for each action
to ensure effective and prompt responses to operations events.
-
Enable push notifications: Communicate directly with your users (for example, with email or SMS)
when the services they use are impacted, and again when the services return to normal
operating conditions, to enable users to take appropriate action.
-
Communicate status through dashboards: Provide dashboards tailored to their target audiences (for example,
internal technical teams, leadership, and customers) to communicate
the current operating status of the business and provide metrics of interest.
-
Automate responses to events: Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent
responses.
Improvement Plan
Use processes for event, incident, and problem management
Use processes for event, incident, and problem management: Have processes to address observed events, events that require intervention (incidents), and events that require intervention and either recur or cannot currently be resolved (problems). Use these processes to mitigate the impact of these events on the business and your customers by ensuring timely and appropriate
responses.
Have a process per alert
Process per alert: Any event for which you raise an alert should have a well defined response (runbook or playbook) with a specifically identified owner (for example, individual,
team, or role) accountable for successful execution. Performance of the response may be automated or conducted by another team but
the owner is accountable for ensuring the process delivers the expected outcomes.
By having these processes, you ensure effective and prompt responses to operations events and you can prevent actionable events from being obscured by less valuable notifications. For example,
automatic scaling might be applied to scale a web front end, but
the operations team might be accountable to ensure that the automatic scaling rules and limits are
appropriate for workload needs.
Prioritize operational events based on business impact
Prioritize operational events based on business impact: Ensure that when multiple events require intervention, those that are most significant to the
business are addressed first. For example, impacts can include
loss of life or injury, financial loss, regulatory violations, or damage to reputation
or trust.
Define escalation paths
Define escalation paths: Define escalation paths in your runbooks and playbooks, including what triggers escalation, and
procedures for escalation. For example, escalation of an issue
from support engineers to senior support engineers when runbooks cannot resolve the issue, or when a predefined period of time has elapsed.
Another example of an appropriate escalation path is from senior support engineers
to the development team for a workload when the playbooks are unable to identify a path to remediation, or when a predefined period of time
has elapsed. Specifically identify owners for each action
to ensure effective and prompt responses to operations events. Escalations can include third parties. For example, a network
connectivity provider or a software vendor. Escalations can include
identified authorized decision makers for impacted systems.
Enable push notifications
Enable push notifications: Communicate directly with your users (for example, with email
or SMS) when the services they use are impacted,
and when the services return to normal operating conditions,
to enable users to take appropriate action.
Amazon SES features
What is Amazon SES?
Set up Amazon SNS notifications
Communicate status through dashboards
Communicate status through dashboards: Provide dashboards tailored to their target audiences (for example,
internal technical teams, leadership, and customers) to communicate
the current operating status of the business and provide metrics of interest.
Providing a self-service option for status information reduces
the disruption of fielding requests for status by the operations team. Examples include Amazon CloudWatch dashboards, and AWS Personal Health Dashboard.
CloudWatch dashboards create and use customized metrics views
Automate responses to events
Automate responses to events: Automate responses to events to reduce errors caused by manual processes, and to ensure prompt and consistent
responses.
What is Amazon CloudWatch Events?
Creating a CloudWatch Events rule that triggers on an event
Creating a CloudWatch Events rule that triggers on an AWS API call using AWS CloudTrail
CloudWatch Events event examples from supported services