“Fail open”

January 11th, 2007 § 1 comment

Story time. Several years ago, I was taking a deep diving class in February in a glacier lake in Washington called Crescent lake. Being February, the water was bitter cold. When we finally reached 129 feet, the unthinkable happened – my regulator failed. It was a simple sport diver model, not designed for extreme conditions. the failure caused my air supply to rapidly bubble to the surface. I was able to perform enough analysis to know that the regulator was done for the dive, and the only option was a surface ascent with my buddy.

The point of the story is this – my regulator “failed open”. They are designed to do this – in the event of a failure the armatures lock in the “air open” position allowing enough time to asses the situation, locate your buddy or redundant systems, and ultimately end the dive alive.

The “Fail open” mentality can be applied to many other situations.

For some software examples:

  • When parsing a filename – look for characters to include, not exclude
  • When looking for an executable in a path, work backwards not forwards (preventing the infamous c:\Program.exe problem on windows)
  • When looking for a feature on a version of the OS, assume it isn’t there – fail false
  • When you have asserts in code, put the retail check in too. Far too often I’ve seen crashes in an application where there was an assert checking the crashing condition.

§ One Response to “Fail open”

  • Jeff Payne says:

    Welcome to my world – this is what I do for a living. I oversee the safety-related design of embedded products for industrial automation systems – chemical plants, power plants, refineries.

    The interesting thing is that the standard I work to (IEC61508) is becoming recognized as a measure of quality as much as functional safety. Our customers want compliant products because of their increased reliability, even in non-safety applications. I’d like to see this become true for the non-embedded space as well, but in today’s world ship dates are still more important than quality. Maybe the growing interest in security – which is closely related to functional safety – will change this.

    I have seen much of the value in worst case analysis and defensive programming. Far too often things that “just can’t happen” do.

    How often do you write code that has to assume a bad pointer is going to come and tromp on your code/data? How about when an undetected transient memory error flips a bit on a piece of critical data? How likely is it that the data coming across your communications interface is corrupted?

    These issues may seem impossible to resolve, but there are fairly simple solutions to detect these types of failures. Yes there is a performance hit, but isn’t it worth it to ensure that something will just work?

    In my business the answer is obviously yes. There was an explosion in a BP refinery that killed 15 and injured 150. They were fined $21 million by OSHA and have budgeted over $700 million for settlements to the families of the victims. This could have been prevented if a simple measurement device with less than 500k of memory worked properly – or at least informed its operator that it wasn’t working properly.

What's this?

You are currently reading “Fail open” at OoeyGUI.

meta