This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/operational-excellence.html

OPS 8: How do you understand the health of your workload?

Define, capture, and analyze workload metrics to gain visibility to workload events so that you can take appropriate action.

Resources

Build a Monitoring Plan
Creating Amazon CloudWatch Alarms
AWS Answers: Centralized Logging

Best Practices:

Improvement Plan

Identify key performance indicators

  • Identify key performance indicators: Identify key performance indicators (KPIs) based on desired business and customer outcomes. Evaluate KPIs to determine workload success.
  • Define workload metrics

  • Define workload metrics: Define workload metrics to measure the achievement of KPIs. Define workload metrics to measure the health of the workload and its individual components. Evaluate metrics to determine if the workload is achieving desired outcomes, and to understand the health of the workload.
    Publish custom metrics
    Searching and filtering log data
    Amazon CloudWatch metrics and dimensions reference
  • Collect and analyze workload metrics

  • Collect and analyze workload metrics: Perform regular proactive reviews of metrics to identify trends and determine where appropriate responses are needed.
    Using Amazon CloudWatch metrics
    Amazon CloudWatch metrics and dimensions reference
    Collect metrics and logs from Amazon EC2 instances and on-premises servers with the CloudWatch Agent
  • Establish workload metrics baselines

  • Establish baselines for workload metrics : Establish baselines for workload metrics to provide expected values as the basis for comparison.
    Creating Amazon CloudWatch alarms
  • Learn expected patterns of activity for workload

  • Learn expected patterns of activity for workload: Establish patterns of workload activity to determine when behavior is outside of the expected values so that you can respond appropriately if required.
  • Alert when workload outcomes are at risk

  • Alert when workload outcomes are at risk: Raise an alert when workload outcomes are at risk so that you can respond appropriately if required.
    What is Amazon CloudWatch Events?
    Creating Amazon CloudWatch alarms
    Invoking Lambda functions using Amazon SNS notifications
  • Alert when workload anomalies are detected

  • Alert when workload anomalies are detected: Raise an alert when workload anomalies are detected so that you can respond appropriately if required.
    What is Amazon CloudWatch Events?
    Creating Amazon CloudWatch alarms
    Invoking Lambda functions using Amazon SNS notifications
  • Validate the achievement of outcomes and the effectiveness of KPIs and metrics

  • Validate the achievement of outcomes and the effectiveness of KPIs and metrics : Create a business level view of your workload operations to help you determine if you are satisfying needs and to identify areas that need improvement to reach business goals. Validate the effectiveness of KPIs and metrics and revise them if necessary.
    Using Amazon CloudWatch dashboards
    What is log analytics?