The Last 1% – How to Troubleshoot Complex Problems

by | Jul 16, 2024 | Water Treatment | 0 comments

Lessons from 10 years of operations and engineering experience in the water industry

I have spent 10 years in operations, commissioning and engineering within the water industry. Over that time, I have had hands on experience solving problems with pumps, pipes, processes, instruments, software, people and more. What I have learned over that time is that you can be 99% right, but the last 1% can bring the entire system down.

This article will go through lessons I have learned, common pitfalls, a step-by-step process to follow and tips on how to tackle complex problems.

Lessons Learnt

In my first year out of university I was working as an operator at a Water Treatment Plant when our pre-lime dose pumps stopped producing flow. A colleague and I believed it was due to a blockage. We flushed all the pipes but were having no luck getting flow. We were feeling a bit stumped when our supervisor, a great troubleshooter, came in to help.

He got stuck straight in systematically pulling pipework apart. I distinctly recall a moment where I was telling him we had already flushed the section of pipe he was pulling apart. He ignored me and did it anyway. Of course, when we looked down the end of the pipe, we could see our blockage.

I remember being a little heated when my supervisor wasn’t listening to my suggestions and embarrassed when he was right. He wasn’t being rude. He knew we wouldn’t need his help if we truly knew the cause of the problem. My supervisor later told me that he never listens to what people tell him when approaching a problem, he proves everything to himself starting from the beginning of the process working all the way through to the end.

It turns out his approach is very effective. If we are wrong about the source of the problem, then our solutions won’t work which is a waste of time and money. If we know the root cause, then the solution is often trivial. This is why my supervisor didn’t listen to my suggestions. Second hand information isn’t good enough to draw meaningful conclusions.

We assumed the section of pipe was clear because we had flushed it. When we looked inside, we found out the truth. Questioning and testing your assumptions is the single most useful tool for finding the root cause of a problem. After finding the blockage my supervisor didn’t stop there. He proceeded to work systematically through all the pipework checking for more blockages. Just because he found the likely culprit he didn’t assume it was the root cause.

Common Pitfalls

A common mistake in troubleshooting is to prematurely jump to conclusions. I have dealt with a wide range of issues, and it seems to be human nature to want to replace the whole system rather than take the time to identify the cause. I call these drastic solutions. For example, it feels like every instrument issue seems to start off with the replacement of expensive I/O cards. Occasionally the I/O card is the problem. When it isn’t, they move on to the next expensive replacement and the next until they finally solve the problem. I find this tendency frustrating since all it takes is a few minutes with a multimeter to verify the original hypothesis saving a lot of time and money.

This lesson applies to more than just small problems. Entire treatment plants can be failing, and drastic solutions are often implemented before the cause of the problem is understood.

When I analyse the situation above, I believe there are a few drivers for the tendency to jump to conclusions:

  1. We are uncomfortable with not knowing why something isn’t working so we cling to the first potential solution we think of.
  2. We gravitate towards thinking this is the same issue as what we have seen in the past. It’s easier to think about what we know rather than considering unknowns.
  3. We tend to believe the issue lies in the component we understand the least. If we have a weak spot in our knowledge, we are more likely to be suspicious of that area and less capable of investigating it.

The solution is simple, but surprisingly difficult to do: Don’t make assumptions, be okay not knowing, take a systematic approach looking at every link in the chain and test every hypothesis/assumption starting in the cheapest and simplest way possible.

Step-by-Step Troubleshooting Guide

Gather Information: Gain an understanding of the problem in a non-invasive manner

  1. If you aren’t the first person to find the issue, ask people what the problem is and what they saw exactly. Question the information provided (try to be as diplomatic as possible). Look for underlying assumptions, bold claims, logical inconsistencies and contradictions in information presented. Verify information gathered in this step with your personal understanding of the system and information in the next step.
  2. Review trends, values and alarms to add to your available information. Start building a timeline of what happened. Try to reverse engineer the sequence of what happened in your mind or on paper.
  3. Go into the field and physically look at the equipment. Is there power to the equipment? Are there status or warning lights active? Take note of everything.
  4. Visualise the system from start to finish to make sure you understand how it works. Think about upstream and downstream processes and how they interact with the system in question. Review O&M Manuals, Functional Descriptions, P&IDs and other operational documentation for more information. See Tip 1 below to learn more about how to get better at this step.
  5. Ask yourself what has changed? If the system was working last week, what has changed since then? Something must have changed to cause the issue. Did the pump duty change? Was there a recent upgrade? Did the breaker trip? Ask others if they made any changes recently.

Conduct Minor Tests: Poke the system in a controlled, safe manner and watch how it responds

  1. Hypothesize what the issue might be. Alternatively, choose an assumption you would like to test.
  2. Develop a method and test the hypothesis/assumption. For example, start the pump and watch it in the field to see if the pump is turning on. Occasionally the test may be to implement a solution, avoid time consuming or expensive changes early in the process if possible.
  3. Review results and iterate from step 1. Use the Gather Information section above for this step. If no cause is appearing after your tests, then broaden the scope of tests further up/down stream. If the cause is seeming more certain, narrow and deepen the scope to increase your confidence.
  4. Only move on when your certainty in the cause of the problem is high.

Implement Solutions and Re-test: Start with small solutions first

  1. Using information gathered from your minor tests, implement the easiest and cheapest fixes first. Do not make too many changes at once. Otherwise, you won’t know what the cause was. See tip 2 below for more on this issue.
  2. Evaluate the effectiveness of the solution. Patience is key. Try not to rush this step. Sometimes intermittent issues require days of operation before conclusions can be drawn. Resist the urge to escalate to expensive overhauls.
  3. Iterate until the solution is found. Return to the previous sections if solutions are not having any effect.

Keep Going: While you are there you may as well check everything

  1. After finding and solving an issue. Think through if there are any other potential issues or causes and test them, just in case.
  2. If you have started at the beginning of the process, keep working through it until the end. For example, work from the dose tank all the way through to the dose point.
  3. Don’t leave site immediately after implementing a solution that works. Monitor the issue over a longer time period to check whether the issue is truly solved. This may require days of ongoing monitoring if the issue is intermittent.

Tip 1 – Understand the Process

A requirement for effective troubleshooting is an accurate understanding of the process and the ability to visualise the system. Here are some tips for getting better at this that you can do now before an issue occurs. Follow the process lines from start to finish by yourself and try to fully understand how it works. Only after doing it alone should you ask help. Having someone explain it to you first is like looking up the answers at the back of a textbook. You learn the answer without gaining understanding. The act of trying is necessary for learning.

If available, read documentation like O&M Manuals, P&IDs and Functional Descriptions to gain a deeper understanding of the system. Ideally, take copies of these documents out into the field and try to match what you see to the documents.

Tip 2 – Multiple Small Issues

A 1% issue can prevent a system from working, but the aggregate of several 0.5% issues can also bring a system down. Fixing any one of these issues may not fix the problem, but still be necessary for resolving the issue. It takes patience, confidence and thoroughness to solve these types of issues. The key is to make a few small changes, test the system then make a few more. Making too many changes at once can make it impossible to know what the actual problem was. Too many changes at once lowers the quality of any information gathered during a test.

Do you have any troubleshooting tips you would like to share? Comment below, I would be interested in hearing them. Please repost this article if you have found it helpful.

About the author: Benjamin Demmer-Knight is a Chemical Engineer with 10 years of experience in water treatment operations and engineering and is the managing director of Water Treatment Engineers NZ.

0 Comments

Submit a Comment

How to Live with Uncertainty

How to Live with Uncertainty

Why you can’t afford to wait and see
With government cuts, the rising cost of living, rising interest rates and legislation changes we’re all feeling uncertain about the future. There is less money to go around, and we have to make hard decisions. The natural response of councils, businesses and individuals is to ‘wait and see’ which is making the situation worse.

Fear and Uncertainty in Risk Management

Fear and Uncertainty in Risk Management

Risk is often presented as a dry, boring subject with a lot of numbers and statistics. It is. But this is also misleading. Fear or worry are helpful emotions for identifying risks while also causing us to overreact to minor risks.

Why Operations Are Holding Up Your Commissioning

Why Operations Are Holding Up Your Commissioning

And why they aren’t just being difficult.
Everyone has had or seen a project flying along smoothly only for operations to come in and block the project from going live at the last second.