Game day
A game day simulates a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened. These should be conducted regularly so that your team builds "muscle memory" on how to respond. Your game days should cover the areas of operations, security, reliability, performance, and cost.
In AWS, your game days can be carried out with replicas of your production environment using AWS CloudFormation. This enables you to test in a safe environment that resembles your production environment closely.
Related
Game day process
A game day should involve all personnel that normally operate the workload. This includes all aspects of your business; specifically operations, test, development, security, business operations, and business leaders.
Define the scenario you want to practice.
- Select the workload that you will be conducting the exercise upon.
- Select the personnel who interact with the workload in any capacity: operations, testing, development, security, business operations, and business leader.
- Define the scenario by selecting the simulated events to practice. Good sources for simulation events are: previous failures, known process or team weaknesses, seasonal spikes in demand,
etc.
- For a comprehensive game day you have a series of simulated events that you run over the course of the game day.
- Select an individual from an operational role who can implement and run the simulation.
- Select an individual to observe the game day. They should be familiar with the communication tools, trouble ticket systems, "war rooms", and key individuals that need to be observed. For distributed teams you may need multiple observers.
- Select a process for making game day announcements.
- Prepare the environment for the game day
- Identify the AWS CloudFormation templates that are required to set up the game day environment. These should include the templates that set up your production environment. By building the game day environment through an "include" approach, you ensure your game day environment always looks like production.
- Add safeguards, such as permission requirements, criticality indicators, and execution verifiers to ensure you can only run game days in a game day environment.
- Permission requirements should authenticate the user, and ensure the user has the authority to carry out the action for the particular environment. This can be implemented using AWS Identify and Access Management (IAM).
- Criticality indicators should indicate the environment is a game day environment. This can be implemented by tagging.
- Execution verifiers should be part of any runbook and automation, checking for a specific change to verify the desired outcome has been achieved.
- You should ensure that a game day criticality indicator cannot be applied to your production environments. For example, with checks in your production continuous integration pipeline.
- Author a runbook for executing the simulation, using automation where possible
- Simulation runbooks should be clearly identified as causing failure or issues, and require an approval to run. Approval process should be tied into your identity system and require considered approval (not a simple y/n).
- The runbook and automation must check criticality indicator that the test is being ran against a game day environment. If the game day criticality indicator is not present in the environment the runbook and automation should not be run.
- Schedule and notify all personnel who will be involved in the game day or affected
by the simulation.
- Personnel should be aware of the schedule day for the game day, and the environment. They should not know what events will be simulated, timing, or order.
Execute your simulation
- Use a separate room or location to run the simulated events.
- Announce the start of the game day.
- Run the simulated events over the course of the day. Be aware of real production system status, and end the
game day if a real event happens.
- Each simulated event should be run using its runbook, with confirmation of execution verifiers. You should not announce the execution of the simulated event, as prompt detection is part of what you are testing for.
- For any physical simulations, such as unplugging equipment, ensure your game day team is not observed because prompt diagnoses is part of what you are testing.
- Through your observers, identify when the issue has been addressed correctly.
- Use feedback from observers to judge if you should delay any simulated events. For teams that have higher maturity, consider executing multiple events simultaneously.
- Announce the end of the game day.
Analyze the game day
- Using an RCA process, document where your tooling, processes, procedures, and personnel do not meet your needs and expectations.
- Debrief game day upon completion to determine if you need to provide education, training, or additional tooling.
- Document opportunities for additional areas to test in subsequent game days.
- Examine the execution of the game day itself, and if it can be improved.
Correction of Error (COE)
- Use a Correction of Error process to address issues.