Software Niagara

presents

46th Monthly Meet-up on Monday, March 13, 2017

Fail-Safes

in Software and Systems Design

by Nicholas Bering

https://nicholasbering.ca

fail-safe

adjective

  1. causing a piece of machinery or other mechanism to revert to a safe condition in the event of a breakdown or malfunction.

noun

  1. a system or plan that comes into operation in the event of something going wrong or that is there to prevent such an occurence.

Recognize the Possibility

Fail-safe systems are not necessarily designed to prevent a failure. They are designed to prevent the consquences of the failure from becoming dangerous or causing collateral damage.

Failure Mode and Effects Analysis (FMEA)

FMEA is a formal analysis framework, used often by engineers of crticial systems such as flight systems. Considers:

  • Failure Mode
  • Potential Causes
  • Effects of Failure
    • Locally
    • On Related Components
    • On the Entire System

After analyzing the effects of a possible failure, engineers would determine...

  • Probability of Failure
  • Severity of Failure
  • Ease of Detection
    • Including the period in which it is likely to go undetected.

This information is used to calculate the risk level as:

Probability × Severity + Detection

Assessing risk in this way allows an engineering team to objectively score items to determine which potential failures receive priority.

Risks that have been identified and scored will usually also have a documented plan or action and a procedure to help mitigate the risk. This might be something as simple as documenting the risk.

Passive Fail-Safe

Physical Example

Braking systems on trains work by using air pressure to release the brake while the train is in motion. If air pressure is lost, the brakes will automatically engage, bringing the train car to a stop.

Passive Fail-Safe

Web Design Example
  • A <noscript> tag can be used to provide alternate content or instructions if their browser does not support JavaScript.
  • Use a Progressive Enhancement design methodology, such that a web page will work correctly with limited browser features, but additional features may be used if they are detected.

Active Fail-Safe

Physical Example
  • The Emergency Brake on a car uses a cable instead of hydraulics to engage the brake. It requires the driver to actively engage the safety mechanism. This is also a type of redundancy.
  • Air bags receive a signal from a sensor in the engine compartment to deploy during a collision. If the car's electrical systems failed before the collision they may not deploy.

Active Fail-Safe

Web Design Example
  • Browser feature shims that patch deficiencies in a browser feature set to keep the application from breaking because of missing features.
  • As script that checks to see if all JavaScript dependencies loaded correctly before starting application code. If something is missing, dynamicly adds an additional tag to an alternate source for the same library.

Active Fail-Safe

Web Infrastructure Example
  • An Amazon EC2 Auto-Scaling pool with one server. If the server fails health checks or becomes unreachable, the auto-scaling logic starts a replacement server and fails-over to it.

General Strategies

  • Redundancy
    • Clustered Servers
    • Raid Arrays
    • Geo-Redundancy
    • Status Page on Separate Site

General Strategies

  • Circuit Breakers
    • Rate Limited APIs
    • System Outage Page Served by Load Balancer
    • Software that Restarts Gracefully on an Unhandled Exception

General Strategies

  • Procedural Fail-Safes
    • Backups Before Upgrades
    • Blue-Green Deployments
    • Active Fail-Over Scenarios

Detection and Logging

  • It's dificult to fix something if you can't detect that it's broken.
  • Logging and error reporting systems help to detect a problem early.
  • Undetected failures are more likely to cause long-term collateral damange.

Thank You