This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/reliability.html

REL 6: How do you monitor workload resources?

Logs and metrics are powerful tools to gain insight into the health of your workload. You can configure your workload to monitor logs and metrics and send notifications when thresholds are crossed or significant events occur. Monitoring enables your workload to recognize when low-performance thresholds are crossed or failures occur, so it can recover automatically in response.

Resources

Using Amazon CloudWatch Metrics
Publishing Custom Metrics
Using Amazon CloudWatch Dashboards
Using Canaries (Amazon CloudWatch Synthetics)
Amazon CloudWatch Logs Insights Sample Queries
AWS Systems Manager Automation
What is AWS X-Ray?
Debugging with Amazon CloudWatch Synthetics and AWS X-Ray
The Amazon Builders' Library: Instrumenting distributed systems for operational visibility

Best Practices:

Monitor all components for the workload (Generation): Monitor the components of the workload with Amazon CloudWatch or third-party tools. Monitor AWS services with Personal Health Dashboard
Define and calculate metrics (Aggregation): Store log data and apply filters where necessary to calculate metrics, such as counts of a specific log event, or latency calculated from log event timestamps
Send notifications (Real-time processing and alarming): Organizations that need to know, receive notifications when significant events occur
Automate responses (Real-time processing and alarming): Use automation to take action when an event is detected, for example, to replace failed components
Storage and Analytics: Collect log files and metrics histories and analyze these for broader trends and workload insights
Conduct reviews regularly: Frequently review how workload monitoring is implemented and update it based on significant events and changes
Monitor end-to-end tracing of requests through your system: Use AWS X-Ray or third-party tools so that developers can more easily analyze and debug distributed systems to understand how their applications and its underlying services are performing

Improvement Plan

Monitor all components for the workload (Generation)

Enable logging where available: AWS has monitoring and log information available for consumption. Monitoring and logs can be used to define alerts, change, and recovery processes

Define all the AWS services you are using
Enable logging for all services: AWS has logging for many services. If the service doesn't have the logging at the level you wish, you can add logging from your workloads
- Enable logging of Amazon S3
  Amazon S3 Server Access Logging
- Enable logging of Elastic Load Balancing
  Access logs for your application load balancer
  Access Logs for Your Network Load Balancer
  Enable Access Logs for Your Classic Load Balancer
- Enable VPC Flow Logs
  VPC Flow Logs
- Enable CloudTrail logs
  Creating a trail
- Use the Amazon CloudWatch Agent to stream log data from instance to CloudWatch Logs
  Install the CloudWatch agent on an Amazon EC2 instance
- Use the awslogs log driver with Amazon ECS to stream log data to CloudWatch Logs
  Using CloudWatch Logs with container instances
- AWS Lambda automatically streams log data to CloudWatch Logs
  Accessing Amazon CloudWatch Logs for AWS Lambda

Consume all default metrics: Every service generates default metrics. Evaluate the metrics to decide which metrics on each service need alerts.
AWS Services That Publish CloudWatch Metrics

Metrics can be evaluated individually or in aggregate
- Go to the CloudWatch console and explore the metrics collected
- Refer to the documentation for which metrics and dimensions are collected

CloudWatch Synthetics enables you to get up Canary tests
Amazon CloudWatch Logs Insights Sample Queries

Create custom metrics for your own use: AWS won't generate some metrics and combinations of metrics, but you can create them using custom metrics
Publish custom metrics

If you need memory usage or disk consumption, use the CloudWatch Agent or PutMetricData API
Monitoring memory and disk metrics for Amazon EC2 linux instances

Aggregate your logs: Log aggregation gives you a single place where you can look at log data and set alerts

Use CloudWatch Logs for common log files
- You can use CloudWatch Logs for most common log aggregation use cases
  What are Amazon CloudWatch Logs?
Store all logs in Amazon S3, or in Amazon S3 Glacier for longer term storage
- You can export CloudWatch Logs to Amazon S3. CloudTrail and Elastic Load Balancing logs are sent to Amazon S3
  Exporting log data to Amazon S3

Define and calculate metrics (Aggregation)

Define and calculate metrics (Aggregation): Store log data and apply filters where necessary to calculate metrics, such as counts of a specific log event, or latency calculated from log event timestamps

Metric filters define the terms and patterns to look for in log data as it is sent to CloudWatch Logs. CloudWatch Logs uses these metric filters to turn log data into numerical CloudWatch metrics that you can graph or set an alarm on
Searching and Filtering Log Data
Use a trusted third party to aggregate logs
- Follow the instructions of the third party. Most third-party products integrate with CloudWatch and Amazon S3
Some AWS services can publish logs directly to Amazon S3. This way, if your main requirement for logs is storage in Amazon S3, you can easily have the service producing the logs send them directly to Amazon S3 without setting up additional infrastructure
Sending Logs Directly to Amazon S3

Send notifications (Real-time processing and alarming)

Perform real-time processing and alarming: Organizations that need to know, receive notifications when significant events occur

Amazon CloudWatch dashboards are customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view, even those resources that are spread across different Regions
Using Amazon CloudWatch Dashboards
Create an alarm when the metric surpasses a limit
Using Amazon CloudWatch Alarms

Automate responses (Real-time processing and alarming)

Use AWS Systems Manager to perform automated actions: AWS Config continuously monitors and records your AWS resource configurations, and can trigger AWS Systems Manager Automation to remediate issues
AWS Systems Manager Automation

Create and use Systems Manager Automation documents. These define the actions that Systems Manager performs on your managed instances and other AWS resources when an automation execution runs
Working with Automation Documents (Playbooks)

Amazon CloudWatch sends alarm state change events to Amazon EventBridge. Create EventBridge rules to automate responses
Creating an EventBridge Rule That Triggers on an Event from an AWS Resource

Create and execute a plan to automate responses

Inventory all your alert response procedures: You must plan your alert responses before you rank the tasks
Inventory all the tasks with specific actions that must be taken: Most of these actions are documented in runbooks. You must also have playbooks for alerts of unexpected events
Examine the runbooks and playbooks for all automatable actions: In general, if an action can be defined, it most likely can be automated
Rank the error-prone or time-consuming activities first: It is most beneficial to remove sources of errors and reduce time to resolution
Establish a plan to complete automation: Maintain an active plan to automate and update the automation
Examine manual requirements for opportunities for automation: Challenge your manual process for opportunities to automate

Storage and Analytics

CloudWatch Logs Insights enables you to interactively search and analyze your log data in Amazon CloudWatch Logs
Analyzing Log Data with CloudWatch Logs Insights
Amazon CloudWatch Logs Insights Sample Queries

Use Amazon CloudWatch Logs send logs to Amazon S3 where you can use or Amazon Athena to query the data
How do I analyze my Amazon S3 server access logs using Athena?

Create an S3 lifecycle policy for your server access logs bucket. Configure the lifecycle policy to periodically remove log files. Doing so reduces the amount of data that Athena analyzes for each query
How Do I Create a Lifecycle Policy for an S3 Bucket?

Conduct reviews regularly

Create multiple dashboards for the workload: You must have a top-level dashboard that contains the key business metrics, as well as the technical metrics you have identified to be the most relevant to the projected health of the workload as usage varies. You should also have dashboards for various application tiers and dependencies that can be inspected
Using Amazon CloudWatch Dashboards

Schedule and conduct regular reviews of the workload dashboards: Conduct regular inspection of the dashboards. You may have different cadences for the depth at which you inspect

Inspect for trends in the metrics: Compare the metric values to historic values to see if there are trends that may indicate that something that needs investigation. Examples of this include: increasing latency, decreasing primary business function, and increasing failure responses
Inspect for outliers/anomalies in your metrics: Averages or medians can mask outliers. Look at the highest and lowest values during the time frame and investigate the causes of extreme scores. As you continue to eliminate these causes, lowering your definition of extreme allows you to continue to improve the consistency of your workload performance
Look for sharp changes in behavior: An immediate change in quantity or direction of a metric may indicate that there has been a change in the application, or external factors that you may need to add additional metrics to track

Monitor end-to-end tracing of requests through your system

Monitor end-to-end tracing of requests through your system: AWS X-Ray is a service that collects data about requests that your application serves, and provides tools you can use to view, filter, and gain insights into that data to identify issues and opportunities for optimization. For any traced request to your application, you can see detailed information not only about the request and response, but also about calls that your application makes to downstream AWS resources, microservices, databases and HTTP web APIs
What is AWS X-Ray?
Debugging with Amazon CloudWatch Synthetics and AWS X-Ray