The Dangers of Blind Trust
Do not blindly trust that your instrumentation is giving you the complete picture of your delivery ecosystem.
How many times have you found yourself in a situation where everything appeared to be fine when suddenly chaos struck? It might have been a production service crashing catastrophically at work, or a problem with your car or an important device that causes it to suddenly malfunction. While the failure itself might have been terrible, it is the sting of the surprise that usually hurts far more.
These sorts of problems happen to all of us. The bigger question is why does it happen?
One of the biggest dangers to our decision making efficacy starts when we begin to blindly trust that everything will always behave as expected. When this occurs, our situational awareness begins to narrow. We begin to grab onto any indication that everything is still normal, quickly dismissing or failing to notice any signs that indicate otherwise.
The way we instrument our ecosystems, and the trust that we put into the information coming from them cutting through all the complexity of our ecosystem to deliver us the infallible truth of what is actually happening, is a huge contributor of this. Describing all the flaws of instrumentation, and the wider problem of how we build and maintain personal and team situational awareness, is a big subject. It is one that requires far more than a single blog article to explain. I wrote hundreds of pages on the subject that I then had to trim way down to fit into the Lean DevOps book. I would also recommend Steven Spear’s Chasing the Rabbit, which I feel also does a good job of explaining a number of these dangers.
In any case, a good place to start to understand these dangers begins by looking at a well known instrumentation incident in another industry: Three Mile Island.
The Three Mile Island Incident
A lot of noise has been made about the Three Mile Island Nuclear incident. Due to how public communication around the event and how it turned public opinion against civilian nuclear power plants, most believe it represents an example of the dangers of nuclear power.
However you feel about nuclear power, the real lesson is the danger of blindly trusting instrumentation in a complex ecosystem.
The incident started when a problem with a regular maintenance activity used to reduce cooling pipe corrosion caused the secondary non-nuclear cooling system to turn off for the second reactor. This halted the steam turbines used to generate power and triggered the nuclear reactor itself to SCRAM, immediately dropping in the control rods to halt the nuclear reaction.
So far, so good.
Normally such action would trigger three auxiliary coolant pumps to start so that water could then cool the reactor. However, the valves for that system had been closed for the initial maintenance, preventing any water from circulating.
Blindly Trusting an Indicator Light
Without any way to cool off the reactor, the amount of steam in the reactor continued to build, increasing the pressure in the reactor. This caused a pressure relief valve to open to release the excess pressure to prevent the reactor from exploding. Once the pressure was reduced, the valve should have automatically closed.
The valve did not close. Yet, due to a faulty solenoid the instrument indicating the state of the valve gave the impression to the operators that it had.
Despite other monitoring devices indicating that something was not normal, the operators happily remained ignorant of the crisis unfolding before them.
This obliviousness continued for another 165 minutes after the incident began. Operators only became aware when radiation alarms began to ring, well after it was too late.
Stumbling Blindly Into History
Having coolant water continuing to stream out through the open relief valve created two problems. The first was that the relief valve tank began to overflow. This allowed radioactive coolant to flow out of the containment building.
The remaining coolant became superheated, causing the steam to cavitate and damage the reactor. The reduction of coolant also exposed the top of the reactor to the air, causing it to go into meltdown. Hydrogen gas also started to build up, later causing a minor explosion.
The radiation alarms were triggered once radioactivity from this coolant had breached the containment building and was released into the environment.
The Aftermath
While the health effects of this release were minor, the situational awareness failures continued to cascade. Trying to unwind what had happened, public announcements from the operator were slow.
This chaos deeply impacted public trust. Some believed that the accident was far worse than what was reported. Others misread the operator blindness as a larger problem with nuclear power itself. Increasingly the public viewed nuclear as far too complex and dangerous for civilian use. Some demanded that other plants be decommissioned, while expensive and often unnecessary additional safety features were demanded from newer designs.
l