REL 13: How do you plan for disaster recovery (DR)?
Having backups and redundant workload components in place is the start of your DR strategy. RTO and RPO are your objectives for restoration of availability. Set these based on business needs. Implement a strategy to meet these objectives, considering locations and function of workload resources and data.
Resources
AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications
(ARC209-R2)
AWS re:Invent 2019: Backup-and-restore and disaster-recovery solutions with AWS (STG208)
What Is AWS Backup?
Remediating Noncompliant AWS Resources by AWS Config Rules
AWS Systems Manager Automation
AWS CloudFormation: Detect Drift on an Entire CloudFormation Stack
Amazon RDS: Cross-region backup copy
RDS: Replicating a Read Replica Across Regions
S3: Cross-Region Replication
Route 53: Configuring DNS Failover
CloudEndure Disaster Recovery
How do I implement an Infrastructure Configuration Management solution on AWS?
CloudEndure Disaster Recovery to AWS
AWS Marketplace: products that can be used for disaster recovery
APN Partner: partners that can help with disaster recovery
Best Practices:
-
Define recovery objectives for downtime and data loss: The workload has a recovery time objective (RTO) and recovery point objective (RPO).
-
Use defined recovery strategies to meet the recovery objectives: A disaster recovery (DR) strategy has been defined to meet objectives.
-
Test disaster recovery implementation to validate the implementation: Regularly test failover to DR to ensure that RTO and RPO are met.
-
Manage configuration drift at the DR site or region: Ensure that the infrastructure, data, and configuration are as needed at the DR site or region. For example, check that AMIs and service quotas are up to date.
-
Automate recovery: Use AWS or third-party tools to automate system recovery and route traffic to the DR site or region.
Improvement Plan
Define recovery objectives for downtime and data loss
- Identify the business mission critical workloads— typically the main revenue drivers and enablers
- Identify the business important workloads— typically reporting and runtime workload modification tools (like content management systems)
- Identify the non-business driving workloads where data may be difficult to recreate (like test systems with cleansed data)
- Identify the non-business driving workloads where data is less difficult or easy to recreate (like development environments)
- Identify other categories as needed
Use defined recovery strategies to meet the recovery objectives
AWS re:Invent 2019: Backup-and-restore and disaster-recovery solutions with AWS (STG208)
Amazon RDS: Cross-region backup copy
RDS: Replicating a Read Replica Across Regions
S3: Cross-Region Replication
- Backup and restore (RPO in hours, RTO in 24 hours or less: Back up your data and applications into the DR Region. Restore this data when necessary to recover from a disaster.
- Pilot light (RPO in minutes, RTO in hours): Maintain a minimal version of an environment always running the most critical core elements of your system in the DR Region. When the time comes for recovery, you can rapidly provision a full-scale production environment around the critical core.
- Warm standby (RPO in seconds, RTO in minutes): Maintain a scaled-down version of a fully functional environment always running in the DR Region. Business-critical systems are fully duplicated and are always on, but with a scaled down fleet. When the time comes for recovery, the system is scaled up quickly to handle the production load.
- Multi-region active-active (RPO is none or possibly seconds, RTO in seconds): Your workload is deployed to, and actively serving traffic from, multiple AWS Regions. This strategy requires you to synchronize users and data across the Regions that
you are using. When the time comes for recovery, use services like Amazon Route 53
or AWS Global Accelerator to route your user traffic to where your workload is healthy.
AWS re:Invent 2018: Architecture Patterns for Multi-Region Active-Active Applications (ARC209-R2)
Build a serverless multi-region, active-active backend solution in an hour
Multi-region serverless backend — reloaded
Test disaster recovery implementation to validate the implementation
The Berkeley/Stanford recovery-oriented computing project
Testing the Disaster Recovery Solution with CloudEndure
CloudEndure Disaster Recovery
CloudEndure Disaster Recovery to AWS
Manage configuration drift at the DR site or region
Remediating Noncompliant AWS Resources by AWS Config Rules
AWS Systems Manager Automation
AWS CloudFormation: Detect Drift on an Entire CloudFormation Stack
Automate recovery
- Use CloudEndure Disaster Recovery for automated Failover and Failback: CloudEndure Disaster Recovery continuously replicates your machines (including operating
system, system state configuration, databases, applications, and files) into a low-cost staging area in your target AWS account and preferred Region. In the case of a disaster,
you can instruct CloudEndure Disaster Recovery to automatically launch thousands of
your machines in their fully provisioned state in minutes.
Performing a Disaster Recovery Failover and Failback
CloudEndure Disaster Recovery