REL 2: How do you build resiliency into your Serverless application?

Evaluate scaling mechanisms for Serverless and non-Serverless resources to meet customer demand, and build resiliency to withstand partial and intermittent failures across dependencies.

Resources

The Amazon Builder's Library
Optimizing AWS SDK for AWS Lambda
AWS Lambda error and retry behavior
Serverless Hero: Production tips for working with Amazon Kinesis Data Streams

Best Practices:

Manage transaction, partial, and intermittent failures: Transaction failures might occur when components are under high load. Partial failures can occur during batch processing, while intermittent failures might occur due to network or other transient issues.
Manage duplicate and unwanted events: Duplicate events can occur when a request is retried, multiple consumers process the same message from a queue or stream, or when a request is sent twice at different time intervals with the same parameters. Design your applications to process multiple identical requests to have the same effect as making a single request. Events not adhering to your schema should be discarded.
Consider scaling patterns at burst rates: In addition to your baseline performance, consider evaluating how your workload handle initial burst rates that may be expected or unexpected peaks.
Orchestrate long-running transactions: Long-running transactions can be processed by one or multiple components. Favor state machines for long-running transaction instead of handling them within application code in a single component or multiple synchronous dependency call chains.

Improvement Plan

Manage transaction, partial, and intermittent failures

Use exponential backoff with jitter.

When responding to callers in fail fast scenarios, and under performance degradation, inform the caller via headers or metadata when they can retry. Amazon SDKs implement this by default.
Implement similar logic in your dependency layer when calling your own dependent services.
For downstream calls, adjust AWS and third-party SDK retries, backoffs, TCP and HTTP timeouts with your component timeout to help you decide when to stop retrying.
Error Retries and Exponential Backoff in AWS

Use a dead-letter queue mechanism to retain, investigate, and retry failed transactions

AWS Lambda allows failed transactions to be sent to a dedicated Amazon SQS dead-letter queue on a per function basis.
Amazon Kinesis Data Stream and Amazon DynamoDB Streams retry the entire batch of items. Repeated errors block processing of the affected shard until the error is resolved or the items expire.
Within AWS Lambda, you can configure Maximum Retry Attempts, Maximum Record Age and Destination on Failure to respectively control retry while processing data records, and effectively remove poison-pill messages from the batch by sending its metadata to an Amazon SQS dead-letter queue for further analysis.
When to use dead-letter queues
Example: Serverless Application Repository DLQ redriver

Manage duplicate and unwanted events

Generate unique attributes needed to manage duplicate events at the beginning of the transaction.

Depending on the final destination, duplicate events might write to the same record with the same content instead of generating a duplicate entry, and therefore may not require additional safeguards.

Use an external system, such as a database, to store unique attributes of a transaction that can be verified for duplicates.

These unique attributes, also known as idempotency tokens, can be business-specific, such as transaction ID, payment ID, booking ID, opaque random alphanumeric string, unique correlation identifiers, or the hash of the content.
Use Amazon DynamoDB as a control database to store transactions and idempotency tokens.
- Examples
  - Use a conditional write to fail a refund operation if a payment reference has already been refunded, thus signaling to the application that it is a duplicate transaction.
  - The application can then catch this exception and return the same result to the customer as if the refund was processed successfully.

Validate events using a pre-defined and agreed upon schema.

AWS Lambda functions can use one or more event sources to trigger invocation. If events can be issues by external sources, your customers or machine generated, use a schema to validate your event conforms with what you’re expecting to process within your application code or at the event source when applicable.
Community Hero: Your AWS Lambda function might execute twice
Stripe Idempotent tokens generated by consumers
Setting Time-to-live for Amazon DynamoDB records/idempotent tokens
Request Validation in Amazon API Gateway
Matching events patterns in EventBridge
Filtering messages with Amazon SNS
JSON Schema Implementations

Consider scaling patterns at burst rates

Perform load test using burst strategy with random intervals of idleness.

Load test using burst of requests for a short period of time and introduce burst delays to allow your components to recover from unexpected load. This allows you to future proof your workload for key events that you may not be certain of how much variance in requests you may receive.
AWS Marketplace: Gatling FrontLine Load Testing
Amazon Partner: BlazeMeter Load Testing
Amazon Partner: Apica Load Testing

Review service account limits with combined utilization across resources.

Review Amazon API Gateway account level limits such as number of requests per second across all APIs.
Review AWS Lambda function concurrency reservations and ensure that there is enough pool capacity to allow other functions to scale.
Review Amazon CloudFront requests per second per distribution.
Review AWS Lambda@Edge requests per second and concurrency limit.
Review AWS IoT Message Broker concurrent requests per second.
Review EventBridge API requests and target invocations limit.
Review Amazon Cognito API limits.
Review Amazon DynamoDB throughput, indexes, and request rates limits.
AWS AppSync throttle rate limits
Amazon CloudFront requests rates limits
AWS Lambda@Edge request rates limits
AWS IoT Message Broker connections and requests limit
EventBridge request rates limits
Amazon Cognito request rates limits
Amazon DynamoDB request rates and resources limit
AWS Services General Limits

Evaluate key metrics to understand how your workload recovers from bursts.

For AWS Lambda, review Duration, Errors, Throttling, and ConcurrentExecutions and UnreservedConcurrentExecutions
For Amazon API Gateway, review Latency, IntegrationLatency 5xxError, 4xxError
For Application Load Balancer, HTTPCode_ELB_5XX_Count, RejectedConnectionCount, HTTPCode_Target_5XX_Count, UnHealthyHostCount, LambdaInternalError, LambdaUserError
For AWS AppSync, 5XX and Latency
For Amazon SQS, ApproximateAgeOfOldestMessage
For Amazon Kinesis Data Streams, ReadProvisionedThroughputExceeded, WriteProvisionedThroughputExceeded, GetRecords.IteratorAgeMilliseconds, PutRecord.Success, PutRecords.Success (if using Kinesis Producer Library) and GetRecords.Success
For Amazon SNS, NumberOfNotificationsFailed, NumberOfNotificationsFilteredOut-InvalidAttributes
For Amazon SES, Rejects, Bounces, Complaints, Rendering Failures
For AWS Step Functions, ExecutionThrottled, ExecutionsFailed, ExecutionsTimedOut
For Amazon EventBridge, FailedInvocations, ThrottledRules
For Amazon S3, 5xxErrors, TotalRequestLatency
For Amazon DynamoDB, ReadThrottleEvents, WriteThrottleEvents, SystemErrors, ThrottledRequests, UserErrors
AWS Lambda CloudWatch Metrics
Amazon API Gateway CloudWatch Metrics
AWS Application Load Balancer CloudWatch Metrics
AWS AppSync CloudWatch Metrics
Amazon SQS CloudWatch Metrics
Amazon Kinesis Data Streams CloudWatch Metrics
Amazon SNS CloudWatch Metrics
Amazon SES CloudWatch Metrics
AWS Step Functions CloudWatch Metrics
Amazon EventBridge CloudWatch Metrics
Amazon S3 CloudWatch Metrics
Amazon DynamoDB CloudWatch Metrics

Orchestrate long-running transactions

Use a state machine to provide a visual representation of distributed transactions, and to separate business logic from orchestration logic.

AWS Step Functions lets you coordinate multiple AWS services into Serverless workflows via state machines.
Within Step Functions, you can set separate retries, backoff rates, max attempts, intervals, and timeouts for every step of your state machine using a declarative language.
State Machine Error handling example
Example AWS Step Functions State Machine via AWS SAM
- Examples
  - The Refund-Flight function will be invoked only if the Allocate-Seat function fails and after three retried attempts with 1-second interval.

Use dead-letter queues in response to failed state machine executions.

For high durability within your state machines, use AWS Step Functions service integrations to send failed transactions to a dead letter queue of your choice as the final step.
For low latency and no strict success rate requirements, you can use function composition with AWS Lambda functions calling other functions asynchronously.
Transactions that may fail will be retried at least twice depending on the event source and sent to each function’s dead-letter queue (for example, Amazon SQS, Amazon SNS).
Set alerts on the number of messages in the dead-letter queue, and either re-drive messages back to the workflow or disable parts of the workflow temporarily.
Sending failed transactions to Amazon SQS within Step Functions State Machine
Serverless Hero: Function composition using asynchronous invocations