OPS 9: How do you understand the health of your operations?
Define, capture, and analyze operations metrics to gain visibility to operations events so that you can take appropriate action.
Resources
Build a Monitoring Plan
Detect and React to Changes in Pipeline State with Amazon CloudWatch Events
AWS Answers: Centralized Logging
Best Practices:
-
Identify key performance indicators: Identify key performance indicators (KPIs) based on desired business (for example, new features delivered) and customer outcomes (for example, customer support cases). Evaluate KPIs to determine operations success.
-
Define operations metrics: Define operations metrics to measure the achievement of KPIs (for example, successful deployments, and failed deployments). Define operations metrics to measure the health of operations activities (for example, mean time to detect an incident (MTTD), and mean time to recovery (MTTR) from an incident). Evaluate metrics to determine if operations are achieving desired outcomes, and to understand the health of your operations activities.
-
Collect and analyze operations metrics: Perform regular, proactive reviews of metrics to identify trends and determine where appropriate responses are needed.
-
Establish operations metrics baselines: Establish baselines for metrics to provide expected values as the basis for comparison and identification of under and over performing operations activities.
-
Learn the expected patterns of activity for operations: Establish patterns of operations activities to identify anomalous activity so that you can respond appropriately if necessary.
-
Alert when operations outcomes are at risk: Raise an alert when operations outcomes are at risk so that you can respond appropriately if necessary.
-
Alert when operations anomalies are detected: Raise an alert when operations anomalies are detected so that you can respond appropriately if necessary.
-
Validate the achievement of outcomes and the effectiveness of KPIs and metrics : Create a business-level view of your operations activities to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals. Validate the effectiveness of KPIs and metrics and revise them if necessary.
Improvement Plan
Identify key performance indicators
Define operations metrics
Publish custom metrics
Searching and filtering log data
Amazon CloudWatch metrics and dimensions reference
Collect and analyze operations metrics
Using Amazon CloudWatch metrics
Amazon CloudWatch metrics and dimensions reference
Collect metrics and logs from Amazon EC2 instances and on-premises servers with the CloudWatch Agent
Establish operations metrics baselines
Learn the expected patterns of activity for operations
Alert when operations outcomes are at risk
What is Amazon CloudWatch Events?
Creating Amazon CloudWatch alarms
Invoking Lambda functions using Amazon SNS notifications
Alert when operations anomalies are detected
What is Amazon CloudWatch Events?
Creating Amazon CloudWatch alarms
Invoking Lambda functions using Amazon SNS notifications
Validate the achievement of outcomes and the effectiveness of KPIs and metrics
Using Amazon CloudWatch dashboards
What is log analytics?