OPS 1: How do you evaluate your Serverless application’s health?

Evaluating your metrics, distributed tracing, and logging gives you insight into business and operational events, and helps you understand which services should be optimized to improve your customer’s experience.

Resources

Amazon CloudWatch Metrics and Dimensions
AWS Personal Health Dashboard
Amazon CloudWatch Automated Dashboard
AWS Serverless Monitoring Partners
re:Invent 2019 - Production-grade full-stack apps with AWS Amplify

Best Practices:

Understand, analyze, and alert on metrics provided out of the box: Each managed service emits metrics out of the box. Establish key metrics for each managed service as the basis for comparison, and for identifying under and over performing components. Examples of key metrics include function errors, queue depth, failed state machine executions, and response times.
Use application, business, and operations metrics: Identify key performance indicators (KPIs) based on desired business and customer outcomes. Evaluate KPIs to determine workload success and operational health.
Use distributed tracing and code is instrumented with additional context: Instrument your application code to emit information about its status, correlation identifiers, business outcomes, and information to determine transaction flows across your workload.
Use structured and centralized logging: Standardize your application logging to emit operational information about transactions, correlation identifiers, request identifiers across components, and business outcomes. Use this information to answer arbitrary questions about the state of your workload.

Improvement Plan

Understand, analyze, and alert on metrics provided out of the box

Understand what metrics and dimensions each managed service utilized provides out of the box

Use Amazon CloudWatch per Service/Cross Service auto-generated dashboard to quickly visualize key metrics for each AWS service you use.
Amazon CloudWatch automated Dashboard
Amazon CloudWatch Cross Service Dashboard

Configure alerts on relevant metrics to engage you when components are unhealthy.

Examples
- For AWS Lambda, alert on Duration, Errors, Throttling, and ConcurrentExecutions. For stream-based invocations, alert on IteratorAge. For Asynchronous invocations, alert on DeadLetterErrors.
- For Amazon API Gateway, IntegrationLatency, Latency, 5XXError
- For Application Load Balancer, HTTPCode_ELB_5XX_Count, RejectedConnectionCount, HTTPCode_Target_5XX_Count, UnHealthyHostCount, LambdaInternalError, LambdaUserError
- For AWS AppSync, 5XX and Latency
- For Amazon SQS, ApproximateAgeOfOldestMessage
- For Amazon Kinesis Data Streams, ReadProvisionedThroughputExceeded, WriteProvisionedThroughputExceeded, GetRecords.IteratorAgeMilliseconds, PutRecord.Success, PutRecords.Success (if using Kinesis Producer Library) and GetRecords.Success
- For Amazon SNS, NumberOfNotificationsFailed, NumberOfNotificationsFilteredOut-InvalidAttributes
- For Amazon SES, Rejects, Bounces, Complaints, Rendering Failures
- For AWS Step Functions, ExecutionThrottled, ExecutionsFailed, ExecutionsTimedOut
- For Amazon EventBridge, FailedInvocations, ThrottledRules
- For Amazon S3, 5xxErrors, TotalRequestLatency
- For Amazon DynamoDB, ReadThrottleEvents, WriteThrottleEvents, SystemErrors, ThrottledRequests, UserErrors

For metrics that have a discernible pattern, trend, has a minimal set of missing data points or that are not key to alert on one-time events, alert based on Amazon CloudWatch Anomaly expected values.
AWS Lambda CloudWatch Metrics
Amazon API Gateway CloudWatch Metrics
AWS Application Load Balancer CloudWatch Metrics
AWS AppSync CloudWatch Metrics
Amazon SQS CloudWatch Metrics
Amazon Kinesis Data Streams CloudWatch Metrics
Amazon SNS CloudWatch Metrics
Amazon SES CloudWatch Metrics
AWS Step Functions CloudWatch Metrics
Amazon EventBridge CloudWatch Metrics
Amazon S3 CloudWatch Metrics
Amazon DynamoDB CloudWatch Metrics
Creating a CloudWatch Alarm based on a static threshold
Creating a CloudWatch Alarm based on Metric Math expressions
Example alerts programmatically created via AWS CloudFormation
Creating a CloudWatch Alarm based on Anomaly detection

Use application, business, and operations metrics

Identify user journeys and metrics that can be derived from each customer transaction.

This exercise provides a mechanism to decide what metrics can be created programmatically and what to alert on. This result provides a more complete picture of your workload’s health impact on business.

Create custom metrics asynchronously as opposed to synchronously for improved performance, cost, and reliability outcomes.
Creating Custom Metrics Asynchronously with Amazon CloudWatch

Emit business metrics from within your workload to measure application performance against business goals.

Examples
- Number of orders created, payment operations, number of reservations, etc.

Create and analyze component metrics to measure interactions with upstream and downstream components.

Examples
- Message queue length, integration latency, throttling, etc.

Create and analyze operational metrics to assess the health of your continuous delivery pipeline and operational processes.

Examples
- CI/CD feedback time, mean-time-between-failure, mean-time-between-recovery, number of on-call pages and time to resolution, etc.

Use distributed tracing and code is instrumented with additional context

Identify common business contexts and system data that are commonly present across multiple transactions.

Use AWS X-Ray annotations or trusted third-party tracing providers labels to easily group or filter traces, for example a customer ID, payment ID or state machine execution ID.
Adding annotations and metadata - AWS X-Ray Python SDK
Adding annotations and metadata - AWS X-Ray Node.js SDK
Adding annotations and metadata - AWS X-Ray Java SDK
Adding annotations and metadata - AWS X-Ray Go SDK
Adding annotations and metadata - AWS X-Ray Ruby SDK
Adding annotations and metadata - AWS X-Ray .NET SDK

Instrument SDKs and requests to upstream/downstream services to understand the flow of a transaction across system.

This will help determine latency distribution, response times, number of retries, response codes, and exceptions.
Use labels/annotations to inject business context for requests made to system and third party to help filter and compare traces across components.
With AWS X-Ray SDK or trusted third-party tracing provider, you can automatically instrument AWS SDKs including popular HTTP and database libraries.
Instrumenting downstream calls - AWS X-Ray Python SDK
Instrumenting downstream calls - AWS X-Ray Node.js SDK
Instrumenting downstream calls - AWS X-Ray Java SDK
Instrumenting downstream calls - AWS X-Ray Go SDK
Instrumenting downstream calls - AWS X-Ray Ruby SDK
Instrumenting downstream calls - AWS X-Ray .NET SDK

Use structured and centralized logging

Log request identifiers from downstream services, component name, component runtime information, unique correlation identifiers and information that helps identify a business transaction.

Examples
- Use correlation_id, request_id, customer_id, service_name, timestamp, function_arn, function_memory as JSON keys.

Use JSON as your logging output. Prefer logging entire objects/dictionaries rather than many one line messages. Mask or remove sensitive data when logging.

Examples
- logging.info({“operation”: “cancel_booking”, “details”: result...})

Minimize logging debugging information as they can incur both costs and increase noise to signal ratio.

Sampling is an efficient mechanism you can use to set log level to debug for a percentage of requests in your logging framework while maintaining default log level to info.
Serverless Hero: Example of using structure logging with AWS Lambda