REL 2: How do you build resiliency into your Serverless application?
Evaluate scaling mechanisms for Serverless and non-Serverless resources to meet customer demand, and build resiliency to withstand partial and intermittent failures across dependencies.
Resources
The Amazon Builder's Library
Optimizing AWS SDK for AWS Lambda
AWS Lambda error and retry behavior
Serverless Hero: Production tips for working with Amazon Kinesis Data Streams
Best Practices:
-
Manage transaction, partial, and intermittent failures: Transaction failures might occur when components are under high load. Partial failures can occur during batch processing, while intermittent failures might occur due to network or other transient issues.
-
Manage duplicate and unwanted events: Duplicate events can occur when a request is retried, multiple consumers process the same message from a queue or stream, or when a request is sent twice at different time intervals with the same parameters. Design your applications to process multiple identical requests to have the same effect as making a single request. Events not adhering to your schema should be discarded.
-
Consider scaling patterns at burst rates: In addition to your baseline performance, consider evaluating how your workload handle initial burst rates that may be expected or unexpected peaks.
-
Orchestrate long-running transactions: Long-running transactions can be processed by one or multiple components. Favor state machines for long-running transaction instead of handling them within application code in a single component or multiple synchronous dependency call chains.
Improvement Plan
Manage transaction, partial, and intermittent failures
- When responding to callers in fail fast scenarios, and under performance degradation, inform the caller via headers or metadata when they can retry. Amazon SDKs implement this by default.
- Implement similar logic in your dependency layer when calling your own dependent services.
- For downstream calls, adjust AWS and third-party SDK retries, backoffs, TCP and HTTP
timeouts with your component timeout to help you decide when to stop retrying.
Error Retries and Exponential Backoff in AWS
- AWS Lambda allows failed transactions to be sent to a dedicated Amazon SQS dead-letter queue on a per function basis.
- Amazon Kinesis Data Stream and Amazon DynamoDB Streams retry the entire batch of items. Repeated errors block processing of the affected shard until the error is resolved or the items expire.
- Within AWS Lambda, you can configure Maximum Retry Attempts, Maximum Record Age and Destination on
Failure to respectively control retry while processing data records, and effectively
remove poison-pill messages from the batch by sending its metadata to an Amazon SQS dead-letter queue for further analysis.
When to use dead-letter queues
Example: Serverless Application Repository DLQ redriver
Manage duplicate and unwanted events
- Depending on the final destination, duplicate events might write to the same record with the same content instead of generating a duplicate entry, and therefore may not require additional safeguards.
- These unique attributes, also known as idempotency tokens, can be business-specific, such as transaction ID, payment ID, booking ID, opaque random alphanumeric string, unique correlation identifiers, or the hash of the content.
- Use Amazon DynamoDB as a control database to store transactions and idempotency tokens.
- Examples
-
Use a conditional write to fail a refund operation if a payment reference has already been refunded, thus signaling to the application that it is a duplicate transaction.
-
The application can then catch this exception and return the same result to the customer as if the refund was processed successfully.
-
- Examples
- AWS Lambda functions can use one or more event sources to trigger invocation. If events can be issues by external sources, your customers or machine generated, use a schema
to validate your event conforms with what you’re expecting to process within your application code or at
the event source when applicable.
Community Hero: Your AWS Lambda function might execute twice
Stripe Idempotent tokens generated by consumers
Setting Time-to-live for Amazon DynamoDB records/idempotent tokens
Request Validation in Amazon API Gateway
Matching events patterns in EventBridge
Filtering messages with Amazon SNS
JSON Schema Implementations
Consider scaling patterns at burst rates
- Load test using burst of requests for a short period of time and introduce burst delays
to allow your components to recover from unexpected load. This allows you to future proof your workload for key events that you may not be certain of how much variance in requests you may receive.
AWS Marketplace: Gatling FrontLine Load Testing
Amazon Partner: BlazeMeter Load Testing
Amazon Partner: Apica Load Testing
- Review Amazon API Gateway account level limits such as number of requests per second across all APIs.
- Review AWS Lambda function concurrency reservations and ensure that there is enough pool capacity to allow other functions to scale.
- Review Amazon CloudFront requests per second per distribution.
- Review AWS Lambda@Edge requests per second and concurrency limit.
- Review AWS IoT Message Broker concurrent requests per second.
- Review EventBridge API requests and target invocations limit.
- Review Amazon Cognito API limits.
- Review Amazon DynamoDB throughput, indexes, and request rates limits.
AWS AppSync throttle rate limits
Amazon CloudFront requests rates limits
AWS Lambda@Edge request rates limits
AWS IoT Message Broker connections and requests limit
EventBridge request rates limits
Amazon Cognito request rates limits
Amazon DynamoDB request rates and resources limit
AWS Services General Limits
- For AWS Lambda, review Duration, Errors, Throttling, and ConcurrentExecutions and UnreservedConcurrentExecutions
- For Amazon API Gateway, review Latency, IntegrationLatency 5xxError, 4xxError
- For Application Load Balancer, HTTPCode_ELB_5XX_Count, RejectedConnectionCount, HTTPCode_Target_5XX_Count, UnHealthyHostCount, LambdaInternalError, LambdaUserError
- For AWS AppSync, 5XX and Latency
- For Amazon SQS, ApproximateAgeOfOldestMessage
- For Amazon Kinesis Data Streams, ReadProvisionedThroughputExceeded, WriteProvisionedThroughputExceeded, GetRecords.IteratorAgeMilliseconds, PutRecord.Success, PutRecords.Success (if using Kinesis Producer Library) and GetRecords.Success
- For Amazon SNS, NumberOfNotificationsFailed, NumberOfNotificationsFilteredOut-InvalidAttributes
- For Amazon SES, Rejects, Bounces, Complaints, Rendering Failures
- For AWS Step Functions, ExecutionThrottled, ExecutionsFailed, ExecutionsTimedOut
- For Amazon EventBridge, FailedInvocations, ThrottledRules
- For Amazon S3, 5xxErrors, TotalRequestLatency
- For Amazon DynamoDB, ReadThrottleEvents, WriteThrottleEvents, SystemErrors, ThrottledRequests, UserErrors
AWS Lambda CloudWatch Metrics
Amazon API Gateway CloudWatch Metrics
AWS Application Load Balancer CloudWatch Metrics
AWS AppSync CloudWatch Metrics
Amazon SQS CloudWatch Metrics
Amazon Kinesis Data Streams CloudWatch Metrics
Amazon SNS CloudWatch Metrics
Amazon SES CloudWatch Metrics
AWS Step Functions CloudWatch Metrics
Amazon EventBridge CloudWatch Metrics
Amazon S3 CloudWatch Metrics
Amazon DynamoDB CloudWatch Metrics
Orchestrate long-running transactions
- AWS Step Functions lets you coordinate multiple AWS services into Serverless workflows via state machines.
- Within Step Functions, you can set separate retries, backoff rates, max attempts,
intervals, and timeouts for every step of your state machine using a declarative language.
State Machine Error handling example
Example AWS Step Functions State Machine via AWS SAM- Examples
-
The Refund-Flight function will be invoked only if the Allocate-Seat function fails and after three retried attempts with 1-second interval.
-
- Examples
- For high durability within your state machines, use AWS Step Functions service integrations to send failed transactions to a dead letter queue of your choice as the final step.
- For low latency and no strict success rate requirements, you can use function composition with AWS Lambda functions calling other functions asynchronously.
- Transactions that may fail will be retried at least twice depending on the event source and sent to each function’s dead-letter queue (for example, Amazon SQS, Amazon SNS).
- Set alerts on the number of messages in the dead-letter queue, and either re-drive
messages back to the workflow or disable parts of the workflow temporarily.
Sending failed transactions to Amazon SQS within Step Functions State Machine
Serverless Hero: Function composition using asynchronous invocations