Resiliency
The ability for a system to recover from a failure induced by load, attacks, and failures.
A resilient workload has the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload's components.
A resilient workload not only recovers, but recovers in an amount of time that is desired. This is often called a recovery time objective (RTO). Within a workload, there is often a desire to not degrade, but to be capable of servicing the workload's requests during the recovery of a component. The study and practice of this implementation is known as Recovery Oriented Computing.
Define the allowable time of recovery.
- Identify where your workload can use redundant components in parallel with no knowledge of past interactions ("state").
- Identify where your workload can fail over to a backup component that will have minimal data loss with respect to previous requests.
- Identify where your workload must restart to recover functionality.
- Implement automation to replace redundant components automatically when they fail.
- Implement automation to fail over to backup components when the primary component fails.
- Implement automation to restart components that cannot be made redundant or fail over.Measure the time of recovery for all failure modes.