Resiliency

The ability for a system to recover from a failure induced by load, attacks, and failures.

A resilient workload has the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload's components.

A resilient workload not only recovers, but recovers in an amount of time that is desired. This is often called a recovery time objective (RTO). Within a workload, there is often a desire to not degrade, but to be capable of servicing the workload's requests during the recovery of a component. The study and practice of this implementation is known as Recovery Oriented Computing.

Define the allowable time of recovery.

Identify where your workload can use redundant components in parallel with no knowledge of past interactions ("state").
Identify where your workload can fail over to a backup component that will have minimal data loss with respect to previous requests.
Identify where your workload must restart to recover functionality.
Implement automation to replace redundant components automatically when they fail.
Implement automation to fail over to backup components when the primary component fails.
Implement automation to restart components that cannot be made redundant or fail over.Measure the time of recovery for all failure modes.