This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/operational-excellence.html

OPS 9: How do you understand the health of your operations?

Define, capture, and analyze operations metrics to gain visibility to operations events so that you can take appropriate action.

Resources

Build a Monitoring Plan
Detect and React to Changes in Pipeline State with Amazon CloudWatch Events
AWS Answers: Centralized Logging

Best Practices:

Improvement Plan

Identify key performance indicators

  • Identify key performance indicators: Identify key performance indicators (KPIs) based on desired business and customer outcomes. Evaluate KPIs to determine operations success.
  • Define operations metrics

  • Define operations metrics: Define operations metrics to measure the achievement of KPIs. Define operations metrics to measure the health of operations and its activities. Evaluate metrics to determine if operations are achieving desired outcomes, and to understand the health of the operations.
    Publish custom metrics
    Searching and filtering log data
    Amazon CloudWatch metrics and dimensions reference
  • Collect and analyze operations metrics

  • Collect and analyze operations metrics: Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed.
    Using Amazon CloudWatch metrics
    Amazon CloudWatch metrics and dimensions reference
    Collect metrics and logs from Amazon EC2 instances and on-premises servers with the CloudWatch Agent
  • Establish operations metrics baselines

  • Learn expected patterns of activity for operations: Establish patterns of operations activity to determine when behavior is outside of the expected values so that you can respond appropriately if required.
  • Learn the expected patterns of activity for operations

  • Learn expected patterns of activity for operations: Establish patterns of operations activity to determine when behavior is outside of the expected values so that you can respond appropriately if required.
  • Alert when operations outcomes are at risk

  • Alert when operations outcomes are at risk: Raise an alert when operations outcomes are at risk so that you can respond appropriately if required.
    What is Amazon CloudWatch Events?
    Creating Amazon CloudWatch alarms
    Invoking Lambda functions using Amazon SNS notifications
  • Alert when operations anomalies are detected

  • Alert when operations anomalies are detected: Raise an alert when operations anomalies are detected so that you can respond appropriately if required.
    What is Amazon CloudWatch Events?
    Creating Amazon CloudWatch alarms
    Invoking Lambda functions using Amazon SNS notifications
  • Validate the achievement of outcomes and the effectiveness of KPIs and metrics

  • Validate the achievement of outcomes and the effectiveness of KPIs and metrics : Create a business level view of your operations activities to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals. Validate the effectiveness of KPIs and metrics and revise them if necessary.
    Using Amazon CloudWatch dashboards
    What is log analytics?