This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/reliability.html

REL 13: How do you plan for disaster recovery (DR)?

Having backups and redundant workload components in place is the start of your DR strategy. RTO and RPO are your objectives for restoration of availability. Set these based on business needs. Implement a strategy to meet these objectives, considering locations and function of workload resources and data.

Resources

AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)
AWS re:Invent 2019: Backup-and-restore and disaster-recovery solutions with AWS (STG208)
What Is AWS Backup?
Remediating Noncompliant AWS Resources by AWS Config Rules
AWS Systems Manager Automation
AWS CloudFormation: Detect Drift on an Entire CloudFormation Stack
Amazon RDS: Cross-region backup copy
RDS: Replicating a Read Replica Across Regions
S3: Cross-Region Replication
Route 53: Configuring DNS Failover
CloudEndure Disaster Recovery
How do I implement an Infrastructure Configuration Management solution on AWS?
CloudEndure Disaster Recovery to AWS
AWS Marketplace: products that can be used for disaster recovery
APN Partner: partners that can help with disaster recovery

Best Practices:

Improvement Plan

Define recovery objectives for downtime and data loss

  • Establish categories of need for your workloads: Identify the primary business driver and enabler workloads. Identify the workloads that are internal only tools, and the workloads that are externally visible tools. Identify the business impact of down-time for each workload. Create five or fewer categories and refine the range of your recovery time objective (RTO) and recovery point objective (RPO) requirements.
  • Use defined recovery strategies to meet the recovery objectives

  • Establish strategies to achieve the recovery time objective (RTO) and recovery point objective (RPO) for each category: If a multi-region strategy is necessary for your workload, you should choose one of the following strategies. They are listed in increasing order of complexity, and decreasing order of RTO and RPO. Backup and restore to another AWS region can add another layer of assurance that data will be available when needed, but for the other strategies you should weigh their potential complexity and cost versus what you can achieve using multiple Availability Zones within an AWS Region.
    AWS re:Invent 2019: Backup-and-restore and disaster-recovery solutions with AWS (STG208)
    Amazon RDS: Cross-region backup copy
    RDS: Replicating a Read Replica Across Regions
    S3: Cross-Region Replication
  • Test disaster recovery implementation to validate the implementation

  • Engineer your workloads for recovery. Regularly test your recovery paths: Recovery Oriented Computing (ROC) identifies the characteristics in systems that enhance recovery. These characteristics are: isolation and redundancy, system-wide ability to roll back changes, ability to monitor and determine health, ability to provide diagnostics, automated recovery, modular design, and ability to restart. Exercise the recovery path to ensure that you can accomplish the recovery in the specified time to the specified state. Use your runbooks during this recovery to document problems and find solutions for them before the next test.
    The Berkeley/Stanford recovery-oriented computing project
  • Use CloudEndure Disaster Recovery to implement and test your DR strategy
    Testing the Disaster Recovery Solution with CloudEndure
    CloudEndure Disaster Recovery
    CloudEndure Disaster Recovery to AWS
  • Manage configuration drift at the DR site or region

  • Ensure that your delivery pipelines deliver to both your primary and backup sites.: Delivery pipelines for deploying applications into production must distribute to all the specified disaster recovery strategy locations, including dev and test environments.
  • Enable AWS Config to track potential drift locations: Use AWS Config rules to create systems that enforce your disaster recovery strategies and generate alerts when they detect drift.
    Remediating Noncompliant AWS Resources by AWS Config Rules
    AWS Systems Manager Automation
  • Use AWS CloudFormation to deploy your infrastructure: AWS CloudFormation can detect drift between what your CloudFormation templates specify and what is actually deployed
    AWS CloudFormation: Detect Drift on an Entire CloudFormation Stack
  • Automate recovery

  • Automate recovery paths: For short recovery times, human judgment and action cannot be used for high availability scenarios. The system should automatically recover under every situation.