REL 6: How do you monitor workload resources?
Logs and metrics are powerful tools to gain insight into the health of your workload. You can configure your workload to monitor logs and metrics and send notifications when thresholds are crossed or significant events occur. Monitoring enables your workload to recognize when low-performance thresholds are crossed or failures occur, so it can recover automatically in response.
Resources
Using Amazon CloudWatch Metrics
Publishing Custom Metrics
Using Amazon CloudWatch Dashboards
Using Canaries (Amazon CloudWatch Synthetics)
Amazon CloudWatch Logs Insights Sample Queries
AWS Systems Manager Automation
What is AWS X-Ray?
Debugging with Amazon CloudWatch Synthetics and AWS X-Ray
The Amazon Builders' Library: Instrumenting distributed systems for operational visibility
Best Practices:
-
Monitor all components for the workload (Generation): Monitor the components of the workload with Amazon CloudWatch or third-party tools. Monitor AWS services with Personal Health Dashboard
-
Define and calculate metrics (Aggregation): Store log data and apply filters where necessary to calculate metrics, such as counts of a specific log event, or latency calculated from log event timestamps
-
Send notifications (Real-time processing and alarming): Organizations that need to know, receive notifications when significant events occur
-
Automate responses (Real-time processing and alarming): Use automation to take action when an event is detected, for example, to replace failed components
-
Storage and Analytics: Collect log files and metrics histories and analyze these for broader trends and workload insights
-
Conduct reviews regularly: Frequently review how workload monitoring is implemented and update it based on significant events and changes
-
Monitor end-to-end tracing of requests through your system: Use AWS X-Ray or third-party tools so that developers can more easily analyze and debug distributed systems to understand how their applications and its underlying services are performing
Improvement Plan
Monitor all components for the workload (Generation)
- Define all the AWS services you are using
- Enable logging for all services: AWS has logging for many services. If the service doesn't have the logging at the
level you wish, you can add logging from your workloads
- Enable logging of Amazon S3
Amazon S3 Server Access Logging - Enable logging of Elastic Load Balancing
Access logs for your application load balancer
Access Logs for Your Network Load Balancer
Enable Access Logs for Your Classic Load Balancer - Enable VPC Flow Logs
VPC Flow Logs - Enable CloudTrail logs
Creating a trail - Use the Amazon CloudWatch Agent to stream log data from instance to CloudWatch Logs
Install the CloudWatch agent on an Amazon EC2 instance - Use the awslogs log driver with Amazon ECS to stream log data to CloudWatch Logs
Using CloudWatch Logs with container instances - AWS Lambda automatically streams log data to CloudWatch Logs
Accessing Amazon CloudWatch Logs for AWS Lambda
- Enable logging of Amazon S3
AWS Services That Publish CloudWatch Metrics
- Metrics can be evaluated individually or in aggregate
- Go to the CloudWatch console and explore the metrics collected
- Refer to the documentation for which metrics and dimensions are collected
Amazon CloudWatch Logs Insights Sample Queries
Publish custom metrics
- If you need memory usage or disk consumption, use the CloudWatch Agent or PutMetricData API
Monitoring memory and disk metrics for Amazon EC2 linux instances
- Use CloudWatch Logs for common log files
- You can use CloudWatch Logs for most common log aggregation use cases
What are Amazon CloudWatch Logs?
- You can use CloudWatch Logs for most common log aggregation use cases
- Store all logs in Amazon S3, or in Amazon S3 Glacier for longer term storage
- You can export CloudWatch Logs to Amazon S3. CloudTrail and Elastic Load Balancing logs are sent to Amazon S3
Exporting log data to Amazon S3
- You can export CloudWatch Logs to Amazon S3. CloudTrail and Elastic Load Balancing logs are sent to Amazon S3
Define and calculate metrics (Aggregation)
- Metric filters define the terms and patterns to look for in log data as it is sent
to CloudWatch Logs. CloudWatch Logs uses these metric filters to turn log data into numerical CloudWatch metrics that you can graph or set an alarm on
Searching and Filtering Log Data - Use a trusted third party to aggregate logs
- Follow the instructions of the third party. Most third-party products integrate with CloudWatch and Amazon S3
- Some AWS services can publish logs directly to Amazon S3. This way, if your main requirement for logs is storage in Amazon S3, you can easily have the service producing the logs send them directly to Amazon S3 without setting up additional infrastructure
Sending Logs Directly to Amazon S3
Send notifications (Real-time processing and alarming)
- Amazon CloudWatch dashboards are customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view, even those resources
that are spread across different Regions
Using Amazon CloudWatch Dashboards - Create an alarm when the metric surpasses a limit
Using Amazon CloudWatch Alarms
Automate responses (Real-time processing and alarming)
AWS Systems Manager Automation
- Create and use Systems Manager Automation documents. These define the actions that
Systems Manager performs on your managed instances and other AWS resources when an
automation execution runs
Working with Automation Documents (Playbooks)
Creating an EventBridge Rule That Triggers on an Event from an AWS Resource
- Inventory all your alert response procedures: You must plan your alert responses before you rank the tasks
- Inventory all the tasks with specific actions that must be taken: Most of these actions are documented in runbooks. You must also have playbooks for alerts of unexpected events
- Examine the runbooks and playbooks for all automatable actions: In general, if an action can be defined, it most likely can be automated
- Rank the error-prone or time-consuming activities first: It is most beneficial to remove sources of errors and reduce time to resolution
- Establish a plan to complete automation: Maintain an active plan to automate and update the automation
- Examine manual requirements for opportunities for automation: Challenge your manual process for opportunities to automate
Storage and Analytics
Analyzing Log Data with CloudWatch Logs Insights
Amazon CloudWatch Logs Insights Sample Queries
How do I analyze my Amazon S3 server access logs using Athena?
- Create an S3 lifecycle policy for your server access logs bucket. Configure the lifecycle
policy to periodically remove log files. Doing so reduces the amount of data that
Athena analyzes for each query
How Do I Create a Lifecycle Policy for an S3 Bucket?
Conduct reviews regularly
Using Amazon CloudWatch Dashboards
- Inspect for trends in the metrics: Compare the metric values to historic values to see if there are trends that may indicate that something that needs investigation. Examples of this include: increasing latency, decreasing primary business function, and increasing failure responses
- Inspect for outliers/anomalies in your metrics: Averages or medians can mask outliers. Look at the highest and lowest values during the time frame and investigate the causes of extreme scores. As you continue to eliminate these causes, lowering your definition of extreme allows you to continue to improve the consistency of your workload performance
- Look for sharp changes in behavior: An immediate change in quantity or direction of a metric may indicate that there has been a change in the application, or external factors that you may need to add additional metrics to track
Monitor end-to-end tracing of requests through your system
What is AWS X-Ray?
Debugging with Amazon CloudWatch Synthetics and AWS X-Ray