This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/operational-excellence.html

OPS 4: How do you design your workload so that you can understand its state?

Design your workload so that it provides the information necessary across all components (for example, metrics, logs, and traces) for you to understand its internal state. This enables you to provide effective responses when appropriate.

Resources

Gaining Better Observability of Your VMs with Amazon CloudWatch
Application Performance Management on AWS
Amazon CloudWatch Documentation

Best Practices:

Implement application telemetry: Instrument your application code to emit information about its internal state, status, and achievement of business outcomes. For example, queue depth, error messages, and response times. Use this information to determine when a response is required.
Implement and configure workload telemetry: Design and configure your workload to emit information about its internal state and current status. For example, API call volume, HTTP status codes, and scaling events. Use this information to help determine when a response is required.
Implement user activity telemetry: Instrument your application code to emit information about user activity, for example, click streams, or started, abandoned, and completed transactions. Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required.
Implement dependency telemetry: Design and configure your workload to emit information about the status (for example, reachability or response time) of resources it depends on. Examples of external dependencies can include, external databases, DNS, and network connectivity. Use this information to determine when a response is required.
Implement transaction traceability: Implement your application code and configure your workload components to emit information about the flow of transactions across the workload. Use this information to determine when a response is required and to assist you in identifying the factors contributing to an issue.

Improvement Plan

Implement application telemetry

Implement log and metric telemetry: Instrument your application code to emit information about their internal state, status, and the achievement of business outcomes. Use this information to determine when a response is required.
Gaining better observability of your VMs with Amazon CloudWatch - AWS Online Tech Talks
How Amazon CloudWatch works
What is Amazon CloudWatch?
Using Amazon CloudWatch metrics
What is Amazon CloudWatch Logs?

Implement application telemetry: Design your application code to emit information about its internal state, status, and achievement of business outcomes (for example, queue depth, error messages, and response times).
Collect metrics and logs from Amazon EC2 Instances and on-premises servers with the CloudWatch Agent
Using CloudWatch Logs with container instances
Accessing Amazon CloudWatch Logs for AWS Lambda
Publish custom metrics

Implement and configure workload telemetry

Implement log and metric telemetry: Instrument your workload to emit information about its internal state, status, and the achievement of business outcomes. Use this information to determine when a response is required.
Gaining better observability of your VMs with Amazon CloudWatch - AWS Online Tech Talks
How Amazon CloudWatch works
What is Amazon CloudWatch?
Using Amazon CloudWatch metrics
What is Amazon CloudWatch Logs?

Implement and configure workload telemetry: Design and configure your workload to emit information about its internal state and current status (for example, API call volume, HTTP status codes, and scaling events).
Amazon CloudWatch metrics and dimensions reference
AWS CloudTrail
What Is AWS CloudTrail?
VPC Flow Logs

Implement user activity telemetry

Implement user activity telemetry: Design your application code to emit information about user activity (for example, click streams, or started, abandoned, and completed transactions). Use this information to help understand how the application is used, patterns of usage, and to determine when a response is required.

Implement dependency telemetry

Implement dependency telemetry: Design and configure your workload to emit information about the state and status of systems it depends on. Some examples include: external databases, DNS, network connectivity, and external credit card processing services.
Amazon CloudWatch Agent with AWS Systems Manager integration - unified metrics & log collection for Linux & Windows
Collect metrics and logs from Amazon EC2 instances and on-premises servers with the CloudWatch Agent

Implement transaction traceability

Implement transaction traceability: Design your application and workload to emit information about the flow of transactions across system components, such as transaction stage, active component, and time to complete activity. Use this information to determine what is in progress, what is complete, and what the results of completed activities are. This helps you determine when a response is required. For example, longer than expected transaction response times within a component can indicate issues with that component.
AWS X-Ray
What is AWS X-Ray?