This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/reliability.html

REL 6: How do you monitor workload resources?

Logs and metrics are powerful tools to gain insight into the health of your workload. You can configure your workload to monitor logs and metrics and send notifications when thresholds are crossed or significant events occur. Monitoring enables your workload to recognize when low-performance thresholds are crossed or failures occur, so it can recover automatically in response.

Resources

Using Amazon CloudWatch Metrics
Publishing Custom Metrics
Using Amazon CloudWatch Dashboards
Using Canaries (Amazon CloudWatch Synthetics)
Amazon CloudWatch Logs Insights Sample Queries
AWS Systems Manager Automation
What is AWS X-Ray?
Debugging with Amazon CloudWatch Synthetics and AWS X-Ray
The Amazon Builders' Library: Instrumenting distributed systems for operational visibility

Best Practices:

Improvement Plan

Monitor all components for the workload (Generation)

  • Enable logging where available: AWS has monitoring and log information available for consumption. Monitoring and logs can be used to define alerts, change, and recovery processes
  • Consume all default metrics: Every service generates default metrics. Evaluate the metrics to decide which metrics on each service need alerts.
    AWS Services That Publish CloudWatch Metrics
  • CloudWatch Synthetics enables you to get up Canary tests
    Amazon CloudWatch Logs Insights Sample Queries
  • Create custom metrics for your own use: AWS won't generate some metrics and combinations of metrics, but you can create them using custom metrics
    Publish custom metrics
  • Aggregate your logs: Log aggregation gives you a single place where you can look at log data and set alerts
  • Define and calculate metrics (Aggregation)

  • Define and calculate metrics (Aggregation): Store log data and apply filters where necessary to calculate metrics, such as counts of a specific log event, or latency calculated from log event timestamps
  • Send notifications (Real-time processing and alarming)

  • Perform real-time processing and alarming: Organizations that need to know, receive notifications when significant events occur
  • Automate responses (Real-time processing and alarming)

  • Use AWS Systems Manager to perform automated actions: AWS Config continuously monitors and records your AWS resource configurations, and can trigger AWS Systems Manager Automation to remediate issues
    AWS Systems Manager Automation
  • Amazon CloudWatch sends alarm state change events to Amazon EventBridge. Create EventBridge rules to automate responses
    Creating an EventBridge Rule That Triggers on an Event from an AWS Resource
  • Create and execute a plan to automate responses
  • Storage and Analytics

  • CloudWatch Logs Insights enables you to interactively search and analyze your log data in Amazon CloudWatch Logs
    Analyzing Log Data with CloudWatch Logs Insights
    Amazon CloudWatch Logs Insights Sample Queries
  • Use Amazon CloudWatch Logs send logs to Amazon S3 where you can use or Amazon Athena to query the data
    How do I analyze my Amazon S3 server access logs using Athena?
  • Conduct reviews regularly

  • Create multiple dashboards for the workload: You must have a top-level dashboard that contains the key business metrics, as well as the technical metrics you have identified to be the most relevant to the projected health of the workload as usage varies. You should also have dashboards for various application tiers and dependencies that can be inspected
    Using Amazon CloudWatch Dashboards
  • Schedule and conduct regular reviews of the workload dashboards: Conduct regular inspection of the dashboards. You may have different cadences for the depth at which you inspect
  • Monitor end-to-end tracing of requests through your system

  • Monitor end-to-end tracing of requests through your system: AWS X-Ray is a service that collects data about requests that your application serves, and provides tools you can use to view, filter, and gain insights into that data to identify issues and opportunities for optimization. For any traced request to your application, you can see detailed information not only about the request and response, but also about calls that your application makes to downstream AWS resources, microservices, databases and HTTP web APIs
    What is AWS X-Ray?
    Debugging with Amazon CloudWatch Synthetics and AWS X-Ray