Troubleshooting with the Scientific Method
Share This:
Troubleshooting is not an art, it's a science, and despite what you've been told, it can be taught.
Using a scientific technique when troubleshooting may not be the swiftest method, but it guarantees success in resolving problems. The Scientific Method can be applied to any troubleshooting situation. This reduces all troubleshooting to a set of common steps that can be adapted to suit your needs. It is not specific to any technology so no specific tools can be covered here. The idea is to learn principles of troubleshooting. You can learn about how to use specific troubleshooting tools elsewhere on this site.
To get the most out of this tutorial, it is highly recommended that you either know, or learn the OSI Model. Learn to combine the Scientific Method with the OSI Model and your troubleshooting will be far more effective and you will achieve success more often--and be able to state with confidence that the failure is due to vendor problems, not your own work.
The Scientific Method is an investigative process that uses logic to formulate and test theories through observation and methodical experimentation. It is the basis of how mankind derives knowledge from the natural world around him. The Scientific Method has been around since mankind first started investigating why and how and shows up as early as 3,000 years ago in India's and Egypt's historical records.
How does the Scientific Method apply to Information Technologies and specifically to troubleshooting? If you want to solve a technical problem, you need a logical and systematic procedure that can be used to sift through the available information, discard what is irrelevant, discover other useful facts and make logical conclusions in order to arrive at the source of the problem. In most cases, you will use the Scientific Method not once but several times to arrive at the source of the problem.
THE SCIENTIFIC METHOD
- Gather Information
- State the Problem
- Form a hypothesis
- Test the hypothesis
- Observe Results & Draw conclusions
- Repeat when necessary
GATHER INFORMATION
You must gather good information about what problem is occurring in order to discover what is not functioning properly. It is absolutely critical that you gather as much information as possible. The most common cause of extended problems and outages is a lack of information.
When gathering information:
- Sort out what is related and unrelated to the problem.
- Write down what you know is related to the problem so you can refer back to it later.
- Most of the pieces of data you collect will suggest one or more things you can test in order to uncover the root of the problem.
- Sketching a diagram of everything you believe to be involved the problem may be helpful.
The information you gather can and should come from multiple sources. There are several ways to gather information about the problem.
Check the Fundamentals
Here, during the early information collection phase is where knowing how a system works from the bottom up becomes useful. These days, there are very few monolithic systems that are totally self contained. Everything is built in layers, one simple layer supporting more advanced/complex functions and in the end providing a working system. Understanding your environment and understanding the OSI Model, are absolutely critical at this phase as they provide direction on what indicators to check. Start from the lowest level and check your indicators, unless you have good input pinpointing the source of the problem.
- Is everything properly plugged in and firmly seated?
- Is there power in the server room?
- Is the device powered on?
- Are the system status indicator light green?
- Is the status light for the component green? Note that status lights flashing a regular amber on-off-on-off pattern are usually a bad sign.
- Do you hear any beeping?
- Do you hear any strange noises?
Checking Event Logs and System Data
There is a wealth of good information in the system and application logs including error messages, crash notifications, errors and exit codes. Collect these because you may need to provide them to the vendor when you contact them for support.
- Windows Event Logs
- UNIX/Linux Syslog
- Application Log
Pinpoint When the Problem Began (What Changed?)
When you are working with a system or application that has always worked perfectly well in the past, you have to determine what changed to cause the current problem to appear. Knowing what changed and when is why you need some sort of Change Management and Change Notification processes within your organization.
Ask:
- When did the problem begin?
- What activity was going on when the problem began?
- What was the last change to the system prior to the problem starting?
- When was the last change applied?
Note Common Symtoms, Causes and Results
- What do the symptoms have in common?
- Is there some symptom that is unique to this problem?
- When you do X, expecting Y, does Z also always happen?
- If all computers share a common problem, then the probability is very good that the cause is also something they share. If only one computer has the problem, then hte odds are good that the computer itself is the problem.
Interview the User
Ask the user what he is experiencing, but treat this information source with extreme caution. Most users are not technical people and thus make unwarranted conclusions about what is wrong. Users also lie on occasion, especially when they think they might be held responsible for whatever is broken or inoperative.
- Try to reproduce the user's error
- Is the problem a real technical failure, or is the computer not doing what the user thinks or expects it should be doing?"
Sometimes the key to fixing a problem is to observe the actual failure as it occurs. It is often a good idea to turn on additional logging or diagnostic modes, run the command in verbose mode or use other diagnostic tools to gather information.
STATE THE PROBLEM
This is the process of reviewing all available information and getting a clear understanding of the perceived failure or dysfunction. Putting the problem into words clarifies exactly what the problem is. The Problem Statement should be very clear about what the problem is, and is not.
Examples of a Problem Statement:
- Since Tuesday, all users logging into the Active Directory network have been reporting that they cannot access their personal share because the drive icon is missing.
- When browsing http://www.blah-blah-website.com, Internet Explorer browser crashes and Mozilla does not.
- All mail sent to someuser@somecompany.com come back with an SMTP Reject message "Stop spamming us"
FORM A HYPOTHESIS
After collecting information and clearly stating exactly what the problem is, formulate a theory as to a possible cause--this should take the form of a question.
- PROBLEM STATEMENT: When any user logs on to the Active Directory
network, they report that they cannot access their personal share because
the drive icon is missing.
- HYPOTHESIS 1: The file server is down
- HYPOTHESIS 2: The script that maps the drive is not running or working correctly.
TEST THE HYPOTHESIS
Once you have stated the problem, devise a method to test your hypothesis of the problem. Each test you perform should follow these simple principles:
- Change only one parameter or setting at a time.
- No other parameters should be changed
OBSERVE RESULTS & DRAW CONCLUSIONS
After each test, note whether the change you made did, or did not solve the problem. You must note the results of your test, gather any new information from the system, application or user and draw a conclusion as to whether the problem is solved or whether the change you made had any affect on the problem. Once you have drawn conclusions, you can devise new tests to eliminate other possible causes.
To quote Sir Arthur Conan Doyle's famous detective:
..when you have eliminated the impossible, whatever remains, however improbable, must be the truth.
--Sherlock Holmes
Translating this addage to modern geek-speak:
..when you've checked the basics and eliminated all configuration and operations-related possible causes, whatever remains is a vendor bug.
REPEAT UNTIL ROOT CAUSE IS FOUND AND THE ERROR IS CORRECTED
The entire scientific method for troubleshooting process must be repeated until a solution is found. This troubleshooting method relies on identifying possible causes and eliminating each cause until the true, root cause of the problem is found. You cannot find and fix the true root cause of the problem unless you apply the scientific method to your troubleshooting.
<< Back to Main Troubleshooting Page
Share This:
If you found this tutorial useful, please DONATE! Donations support the creation and maintenance of this, and other tutorials throughout this site.