This content is outdated. This version of the Well-Architected Framework is now found at: https://docs.aws.amazon.com/en_us/wellarchitected/2022-03-31/framework/reliability.html

REL 13: How do you plan for disaster recovery (DR)?

Having backups and redundant workload components in place is the start of your DR strategy. RTO and RPO are your objectives for restoration of availability. Set these based on business needs. Implement a strategy to meet these objectives, considering locations and function of workload resources and data.

Resources

AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)
AWS re:Invent 2019: Backup-and-restore and disaster-recovery solutions with AWS (STG208)
What Is AWS Backup?
Remediating Noncompliant AWS Resources by AWS Config Rules
AWS Systems Manager Automation
AWS CloudFormation: Detect Drift on an Entire CloudFormation Stack
Amazon RDS: Cross-region backup copy
RDS: Replicating a Read Replica Across Regions
S3: Cross-Region Replication
Route 53: Configuring DNS Failover
CloudEndure Disaster Recovery
How do I implement an Infrastructure Configuration Management solution on AWS?
CloudEndure Disaster Recovery to AWS
AWS Marketplace: products that can be used for disaster recovery
APN Partner: partners that can help with disaster recovery

Best Practices:

Define recovery objectives for downtime and data loss: The workload has a recovery time objective (RTO) and recovery point objective (RPO).
Use defined recovery strategies to meet the recovery objectives: A disaster recovery (DR) strategy has been defined to meet objectives.
Test disaster recovery implementation to validate the implementation: Regularly test failover to DR to ensure that RTO and RPO are met.
Manage configuration drift at the DR site or region: Ensure that the infrastructure, data, and configuration are as needed at the DR site or region. For example, check that AMIs and service quotas are up to date.
Automate recovery: Use AWS or third-party tools to automate system recovery and route traffic to the DR site or region.

Improvement Plan

Define recovery objectives for downtime and data loss

Establish categories of need for your workloads: Identify the primary business driver and enabler workloads. Identify the workloads that are internal only tools, and the workloads that are externally visible tools. Identify the business impact of down-time for each workload. Create five or fewer categories and refine the range of your recovery time objective (RTO) and recovery point objective (RPO) requirements.

Identify the business mission critical workloads— typically the main revenue drivers and enablers
Identify the business important workloads— typically reporting and runtime workload modification tools (like content management systems)
Identify the non-business driving workloads where data may be difficult to recreate (like test systems with cleansed data)
Identify the non-business driving workloads where data is less difficult or easy to recreate (like development environments)
Identify other categories as needed

Use defined recovery strategies to meet the recovery objectives

Establish strategies to achieve the recovery time objective (RTO) and recovery point objective (RPO) for each category: If a multi-region strategy is necessary for your workload, you should choose one of the following strategies. They are listed in increasing order of complexity, and decreasing order of RTO and RPO. Backup and restore to another AWS region can add another layer of assurance that data will be available when needed, but for the other strategies you should weigh their potential complexity and cost versus what you can achieve using multiple Availability Zones within an AWS Region.
AWS re:Invent 2019: Backup-and-restore and disaster-recovery solutions with AWS (STG208)
Amazon RDS: Cross-region backup copy
RDS: Replicating a Read Replica Across Regions
S3: Cross-Region Replication

Backup and restore (RPO in hours, RTO in 24 hours or less: Back up your data and applications into the DR Region. Restore this data when necessary to recover from a disaster.
Pilot light (RPO in minutes, RTO in hours): Maintain a minimal version of an environment always running the most critical core elements of your system in the DR Region. When the time comes for recovery, you can rapidly provision a full-scale production environment around the critical core.
Warm standby (RPO in seconds, RTO in minutes): Maintain a scaled-down version of a fully functional environment always running in the DR Region. Business-critical systems are fully duplicated and are always on, but with a scaled down fleet. When the time comes for recovery, the system is scaled up quickly to handle the production load.
Multi-region active-active (RPO is none or possibly seconds, RTO in seconds): Your workload is deployed to, and actively serving traffic from, multiple AWS Regions. This strategy requires you to synchronize users and data across the Regions that you are using. When the time comes for recovery, use services like Amazon Route 53 or AWS Global Accelerator to route your user traffic to where your workload is healthy.
AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)
Build a serverless multi-region, active-active backend solution in an hour
Multi-region serverless backend — reloaded

Test disaster recovery implementation to validate the implementation

Engineer your workloads for recovery. Regularly test your recovery paths: Recovery Oriented Computing (ROC) identifies the characteristics in systems that enhance recovery. These characteristics are: isolation and redundancy, system-wide ability to roll back changes, ability to monitor and determine health, ability to provide diagnostics, automated recovery, modular design, and ability to restart. Exercise the recovery path to ensure that you can accomplish the recovery in the specified time to the specified state. Use your runbooks during this recovery to document problems and find solutions for them before the next test.
The Berkeley/Stanford recovery-oriented computing project

Use CloudEndure Disaster Recovery to implement and test your DR strategy
Testing the Disaster Recovery Solution with CloudEndure
CloudEndure Disaster Recovery
CloudEndure Disaster Recovery to AWS

Manage configuration drift at the DR site or region

Ensure that your delivery pipelines deliver to both your primary and backup sites.: Delivery pipelines for deploying applications into production must distribute to all the specified disaster recovery strategy locations, including dev and test environments.

Enable AWS Config to track potential drift locations: Use AWS Config rules to create systems that enforce your disaster recovery strategies and generate alerts when they detect drift.
Remediating Noncompliant AWS Resources by AWS Config Rules
AWS Systems Manager Automation

Use AWS CloudFormation to deploy your infrastructure: AWS CloudFormation can detect drift between what your CloudFormation templates specify and what is actually deployed
AWS CloudFormation: Detect Drift on an Entire CloudFormation Stack

Automate recovery

Automate recovery paths: For short recovery times, human judgment and action cannot be used for high availability scenarios. The system should automatically recover under every situation.

Use CloudEndure Disaster Recovery for automated Failover and Failback: CloudEndure Disaster Recovery continuously replicates your machines (including operating system, system state configuration, databases, applications, and files) into a low-cost staging area in your target AWS account and preferred Region. In the case of a disaster, you can instruct CloudEndure Disaster Recovery to automatically launch thousands of your machines in their fully provisioned state in minutes.
Performing a Disaster Recovery Failover and Failback
CloudEndure Disaster Recovery