What is fault tolerance?

If a system has a single point of failure, when one thing goes wrong, the whole process falls offline.

It’s possible to design systems with better fault tolerance so they can continue even after components fail, but is this always the best decision?

Before looking at where to try to build out fault tolerance, it’s important to start with a solid definition.

What is fault tolerance? 

Just like the name suggests, fault tolerance is a system’s ability to tolerate faults. So, it’s a system’s ability to keep going even after one or more of its components have failed. 

On the most basic level, it’s just like in those old ads for Timex, where they claim their watch “takes a licking and keeps on ticking.” 

But it’s a sliding scale, with some systems more fault tolerant that others. 

What are the types and some examples of fault tolerance? 

Although it’s possible for a system to 100% fault intolerant due to single points of failure (SPOF), where one failure pulls down the entire process, other systems can have varying degrees of tolerance. 

High fault tolerant 

Here, even when there’s one or more faults, the system continues to perform at the same level. For example, a facility might have backup generators that, in the case of a blackout, deliver the same level of electricity as the power grid. If you were inside the facility, you wouldn’t notice any changes at all, because all the lights would still be on, and the elevators would still be running. 

Low fault tolerant and fail safe 

But here, faults affect the level of performance, usually directly in proportion to the size and number of faults. So, there might be backup generators at your facility, but when there’s a blackout, you get power to emergency lighting and elevators only. The idea is that the system fails in ways that prevent injury to people and damage to property. 

It’s possible, and often desirable, to design systems that are the opposite of fail safe, where they are fail deadly. Usually related to military applications, the system “fires” even after some parts no longer work. 

What are the Three Cs, the fault tolerance criteria? 

Fault tolerance is appealing because of the promise of continued operation even after failure. But there are other considerations to weigh before trying to add fault tolerance to every system or component. 

For example, being able to keep driving even after a tire blowout is a huge advantage. But does that advantage outweigh the drawbacks, including increased costs and faster wear? Run-flat tires tend to cost about 30% more than their conventional cousins, and they also tend to wear out roughly 6000 miles sooner. 

Depending on the system and components, you might face increases in: 

  • Weight 
  • Size 
  • Consumption 
  • Cost  

Those are the ongoing penalties. There are also likely one-time ones tied to additional initial planning and testing. 

So, it’s important to ask the right questions and ensure components meet specific criteria before trying to make them fault tolerant. 

An easy way to remember is using the Three Cs of fault tolerance: criticality, chances of failure, and costs. 

Criticality 

First, you need to look at overall criticality. How critical is the system or component? If you could conceivably get by for a while without it, you don’t need to invest in its tolerance. Remember, criticality is relative, so something might be critical in one system but not another. 

A quick example is light bulbs. The ones in the headlights of your car are critical to driving safely at night. But the ones that light up when you open one of the doors? Much less so. 

Chances of failure 

Second, look at the lifelong likelihood of failure. For example, you likely don’t need to make the foundation of your facility fault tolerant because there is little chance of it failing.   

Cost

Third, look at cost. Even in cases where something is critical and could fail, it might not make economic sense to build out additional tolerance. Back to the example of light bulbs. There are ways to make light bulbs with a lot of fault tolerance, but no one does. Light bulbs are cheap to buy and hold in inventory. And when one burns out, a maintenance tech can replace it quickly, easily.

Depending on the asset or equipment, it can make more sense to focus on a different maintenance strategy, for example preventive maintenance.

How does an EAM help with fault tolerance? 

Remember, making good decisions about fault tolerance starts with a clear understanding of criticality, chances of failure, and costs. Basically, the more you know about your operations and the systems inside it, the easier it is for you to make the right decisions about fault tolerance, including where you need it and where you don’t. That can also include helping you decide which systems to back up and, moving forward, which assets to purchase. 

Modern enterprise asset management (EAM) platforms help you capture clean data and then keep is safe, secure, and accessible. And once it’s inside the software, you can leverage your data into maintenance metrics and key performance indicators, actionable business intelligence. 

Next steps

To find out how EAM software can support your goals, schedule a demo of ManagerPlus Lightning today. 

Summary 

Fault tolerance is a system or component’s ability to continue functioning after a partial failure. It’s the resistance to failure. Broadly speaking, there are three levels of fault tolerance: zero, high, and low. If a system has zero, it means there is one or more single points of failure. If one of them fails, the entire system falls offline. For example, if you lose the valve stem core on a tire, it can’t hold pressure, and you can’t use it. High fault tolerance is when the system continues even after a failure. So, facilities with backup generators still have all their lights and electricity during a blackout. But, if it is low tolerance, there might only be power to emergency lighting and the elevators. Here, the goal is to fail safe, which means designing the fault tolerance is minimize risk to people and damage to property. Although it’s tempting to try to increase fault tolerance, so systems never go offline, there are always tradeoffs that you need to weight carefully. Using the Three Cs of fault tolerance, criticality, chances of failure, and cost, you can determine case by case which systems to make more tolerant. The right EAM software makes the process easier because it helps you collect the data you need to make the right decisions. 

Scroll to Top