OPS 1: How do you evaluate your Serverless application’s health?
Evaluating your metrics, distributed tracing, and logging gives you insight into business and operational events, and helps you understand which services should be optimized to improve your customer’s experience.
Resources
Amazon CloudWatch Metrics and Dimensions
AWS Personal Health Dashboard
Amazon CloudWatch Automated Dashboard
AWS Serverless Monitoring Partners
re:Invent 2019 - Production-grade full-stack apps with AWS Amplify
Best Practices:
-
Understand, analyze, and alert on metrics provided out of the box: Each managed service emits metrics out of the box. Establish key metrics for each managed service as the basis for comparison, and for identifying under and over performing components. Examples of key metrics include function errors, queue depth, failed state machine executions, and response times.
-
Use application, business, and operations metrics: Identify key performance indicators (KPIs) based on desired business and customer outcomes. Evaluate KPIs to determine workload success and operational health.
-
Use distributed tracing and code is instrumented with additional context: Instrument your application code to emit information about its status, correlation identifiers, business outcomes, and information to determine transaction flows across your workload.
-
Use structured and centralized logging: Standardize your application logging to emit operational information about transactions, correlation identifiers, request identifiers across components, and business outcomes. Use this information to answer arbitrary questions about the state of your workload.
Improvement Plan
Understand, analyze, and alert on metrics provided out of the box
- Use Amazon CloudWatch per Service/Cross Service auto-generated dashboard to quickly visualize key metrics
for each AWS service you use.
Amazon CloudWatch automated Dashboard
Amazon CloudWatch Cross Service Dashboard
- Examples
-
For AWS Lambda, alert on Duration, Errors, Throttling, and ConcurrentExecutions. For stream-based invocations, alert on IteratorAge. For Asynchronous invocations, alert on DeadLetterErrors.
-
For Amazon API Gateway, IntegrationLatency, Latency, 5XXError
-
For Application Load Balancer, HTTPCode_ELB_5XX_Count, RejectedConnectionCount, HTTPCode_Target_5XX_Count, UnHealthyHostCount, LambdaInternalError, LambdaUserError
-
For AWS AppSync, 5XX and Latency
-
For Amazon SQS, ApproximateAgeOfOldestMessage
-
For Amazon Kinesis Data Streams, ReadProvisionedThroughputExceeded, WriteProvisionedThroughputExceeded, GetRecords.IteratorAgeMilliseconds, PutRecord.Success, PutRecords.Success (if using Kinesis Producer Library) and GetRecords.Success
-
For Amazon SNS, NumberOfNotificationsFailed, NumberOfNotificationsFilteredOut-InvalidAttributes
-
For Amazon SES, Rejects, Bounces, Complaints, Rendering Failures
-
For AWS Step Functions, ExecutionThrottled, ExecutionsFailed, ExecutionsTimedOut
-
For Amazon EventBridge, FailedInvocations, ThrottledRules
-
For Amazon S3, 5xxErrors, TotalRequestLatency
-
For Amazon DynamoDB, ReadThrottleEvents, WriteThrottleEvents, SystemErrors, ThrottledRequests, UserErrors
-
- For metrics that have a discernible pattern, trend, has a minimal set of missing data
points or that are not key to alert on one-time events, alert based on Amazon CloudWatch Anomaly expected values.
AWS Lambda CloudWatch Metrics
Amazon API Gateway CloudWatch Metrics
AWS Application Load Balancer CloudWatch Metrics
AWS AppSync CloudWatch Metrics
Amazon SQS CloudWatch Metrics
Amazon Kinesis Data Streams CloudWatch Metrics
Amazon SNS CloudWatch Metrics
Amazon SES CloudWatch Metrics
AWS Step Functions CloudWatch Metrics
Amazon EventBridge CloudWatch Metrics
Amazon S3 CloudWatch Metrics
Amazon DynamoDB CloudWatch Metrics
Creating a CloudWatch Alarm based on a static threshold
Creating a CloudWatch Alarm based on Metric Math expressions
Example alerts programmatically created via AWS CloudFormation
Creating a CloudWatch Alarm based on Anomaly detection
Use application, business, and operations metrics
- This exercise provides a mechanism to decide what metrics can be created programmatically and what to alert on. This result provides a more complete picture of your workload’s health impact on business.
Creating Custom Metrics Asynchronously with Amazon CloudWatch
- Examples
-
Number of orders created, payment operations, number of reservations,
etc.
-
- Examples
-
Message queue length, integration latency, throttling, etc.
-
- Examples
-
CI/CD feedback time, mean-time-between-failure, mean-time-between-recovery, number of on-call pages and time to resolution, etc.
-
Use distributed tracing and code is instrumented with additional context
- Use AWS X-Ray annotations or trusted third-party tracing providers labels to easily group or filter
traces, for example a customer ID, payment ID or state machine execution ID.
Adding annotations and metadata - AWS X-Ray Python SDK
Adding annotations and metadata - AWS X-Ray Node.js SDK
Adding annotations and metadata - AWS X-Ray Java SDK
Adding annotations and metadata - AWS X-Ray Go SDK
Adding annotations and metadata - AWS X-Ray Ruby SDK
Adding annotations and metadata - AWS X-Ray .NET SDK
- This will help determine latency distribution, response times, number of retries, response codes, and exceptions.
- Use labels/annotations to inject business context for requests made to system and third party to help filter and compare traces across components.
- With AWS X-Ray SDK or trusted third-party tracing provider, you can automatically instrument AWS
SDKs including popular HTTP and database libraries.
Instrumenting downstream calls - AWS X-Ray Python SDK
Instrumenting downstream calls - AWS X-Ray Node.js SDK
Instrumenting downstream calls - AWS X-Ray Java SDK
Instrumenting downstream calls - AWS X-Ray Go SDK
Instrumenting downstream calls - AWS X-Ray Ruby SDK
Instrumenting downstream calls - AWS X-Ray .NET SDK
Use structured and centralized logging
- Examples
-
Use
correlation_id, request_id, customer_id, service_name, timestamp, function_arn, function_memory
as JSON keys.
-
- Examples
-
logging.info({“operation”: “cancel_booking”, “details”: result...})
-
- Sampling is an efficient mechanism you can use to set log level to debug for a percentage
of requests in your logging framework while maintaining default log level to info.
Serverless Hero: Example of using structure logging with AWS Lambda