Software Niagara
presents
46th Monthly Meet-up on Monday, March 13, 2017
fail-safe
adjective
- causing a piece of machinery or other mechanism to revert to a safe condition in the event of a breakdown or malfunction.
noun
- a system or plan that comes into operation in the event of something going wrong or that is there to prevent such an occurence.
Recognize the Possibility
Fail-safe systems are not necessarily designed to prevent a failure.
They are designed to prevent the consquences of the failure from becoming
dangerous or causing collateral damage.
Failure Mode and Effects Analysis (FMEA)
FMEA is a formal analysis framework, used often by engineers of crticial systems
such as flight systems. Considers:
- Failure Mode
- Potential Causes
- Effects of Failure
- Locally
- On Related Components
- On the Entire System
After analyzing the effects of a possible failure, engineers would determine...
- Probability of Failure
- Severity of Failure
- Ease of Detection
- Including the period in which it is likely to go undetected.
This information is used to calculate the risk level as:
Probability × Severity + Detection
Assessing risk in this way allows an engineering team to objectively
score items to determine which potential failures receive priority.
Risks that have been identified and scored will usually also have a documented
plan or action and a procedure to help mitigate the risk. This might be something
as simple as documenting the risk.
Passive Fail-Safe
Physical Example
Braking systems on trains work by using air pressure to release the
brake while the train is in motion. If air pressure is lost, the
brakes will automatically engage, bringing the train car to a stop.
Passive Fail-Safe
Web Design Example
-
A <noscript> tag can be used to provide alternate content
or instructions if their browser does not support JavaScript.
-
Use a Progressive Enhancement design methodology, such that a
web page will work correctly with limited browser features,
but additional features may be used if they are detected.
Active Fail-Safe
Physical Example
-
The Emergency Brake on a car uses a cable instead of hydraulics to
engage the brake. It requires the driver to actively engage the
safety mechanism. This is also a type of redundancy.
-
Air bags receive a signal from a sensor in the engine compartment
to deploy during a collision. If the car's electrical systems failed
before the collision they may not deploy.
Active Fail-Safe
Web Design Example
-
Browser feature shims that patch deficiencies in a browser feature
set to keep the application from breaking because of missing features.
-
As script that checks to see if all JavaScript dependencies loaded correctly
before starting application code. If something is missing, dynamicly adds
an additional tag to an alternate source for the same library.
Active Fail-Safe
Web Infrastructure Example
-
An Amazon EC2 Auto-Scaling pool with one server. If the server fails health checks
or becomes unreachable, the auto-scaling logic starts a replacement server and fails-over
to it.
General Strategies
- Redundancy
- Clustered Servers
- Raid Arrays
- Geo-Redundancy
- Status Page on Separate Site
General Strategies
- Circuit Breakers
- Rate Limited APIs
- System Outage Page Served by Load Balancer
- Software that Restarts Gracefully on an Unhandled Exception
General Strategies
- Procedural Fail-Safes
- Backups Before Upgrades
- Blue-Green Deployments
- Active Fail-Over Scenarios
Detection and Logging
- It's dificult to fix something if you can't detect that it's broken.
- Logging and error reporting systems help to detect a problem early.
- Undetected failures are more likely to cause long-term collateral damange.