Troubleshooting is not an art, it's a science, and despite what you've been told, it can be taught. Most engineers and technicians work by past experience alone. If they have seen a problem before, and have been shown how to solve it, they can fix it. When it is a new problem they haven't seen before, the problem will escalate out of control and Management will be looking for someone to fire.

Using the Scientific Method when troubleshooting may not be the swiftest path to a resolution, but is the most certain path to resolving complex problems and finding a permanent fix. The Scientific Method can be applied to any troubleshooting situation. This method reduces all troubleshooting to a standard set of common steps that can be adapted to suit your needs. It is not specific to any technology so no specific tools will be covered here. The idea is to learn principles of troubleshooting. You can learn about how to use specific computer troubleshooting tools and network troubleshooting tools elsewhere on this site.

To get the most out of this tutorial, it is highly recommended that you either know, or learn the OSI Model. Learn to combine the Scientific Method with the OSI Model and your troubleshooting will be far more effective and you will achieve success more often--and be able to state with confidence that the failure is due to vendor problems, not your own work.

The Scientific Method is an investigative process that uses logic to formulate and test theories through observation and methodical experimentation. It is the basis of how mankind derives knowledge from the natural world around him. The Scientific Method has been around since mankind first started asking "Why?" and "How?" and shows up as early as 3,000 years ago in India's and Egypt's historical records.

How does the Scientific Method apply to Information Technologies and specifically to troubleshooting? If you want to solve a technical problem, you need a logical and systematic procedure that can be used to sift through the available information, discard what is irrelevant, discover other useful facts and make logical conclusions in order to arrive at the source of the problem. In most cases, you will use the Scientific Method not once but several times to arrive at the source of the problem.

The Scientific Method

The Scientific Method is the key to troubleshooting your computer and network problems. It burns away irrelevancies and brings you to the root cause. There are six steps in the scientific method:

  1. Gather Information
  2. State the Problem
  3. Form a hypothesis
  4. Test the hypothesis
  5. Observe Results & Draw conclusions
  6. Repeat as necessary

 

Gather Information

"It is a capital mistake to theorize before you have all the evidence. It biases the judgment." -- Shirlock Holmes, A Study In Scarlet, Ch. 3, p. 27

You must gather reliable information about what problem is occurring in order to discover what is not functioning properly. It is absolutely critical that you gather as much information as possible. The most common cause of extended problems and outages is a lack of information.

When gathering information:

  • Sort out what is related and unrelated to the problem.
  • Write down what you know is related to the problem so you can refer back to it later.
  • Most of the pieces of data you collect will suggest one or more things you can test in order to uncover the root of the problem.
  • Sketching a diagram of everything you believe to be involved the problem may be helpful.

The information you gather can and should come from multiple sources. There are several ways to gather information about the problem.

Check the Fundamentals

Here, during the early information collection phase is where knowing how a system works from the bottom up becomes useful. These days, there are very few monolithic systems that are totally self contained. Everything is built in layers, one simple layer supporting more advanced/complex functions and in the end providing a working system. Understanding your environment and understanding the OSI Model, are absolutely critical at this phase as they provide direction on what indicators to check. Start from the lowest level and check your indicators, unless you have good input pinpointing the source of the problem.

  • Is everything properly plugged in and firmly seated?
  • Is there power in the server room?
  • Is the device powered on?
  • Are the system status indicator light green?
  • Is the status light for the component green? Note that status lights flashing a regular amber on-off-on-off pattern are usually a bad sign.
  • Do you hear any beeping?
  • Do you hear any strange noises?

Checking Event Logs and System Data

There is a wealth of good information in the system and application logs including error messages, crash notifications, errors and exit codes. Collect these because you may need to provide them to the vendor when you contact them for support.

  • Windows Event Logs
  • UNIX/Linux Syslog
  • Application Log

Pinpoint When the Problem Began (What Changed?)

When you are working with a system or application that has always worked perfectly well in the past, you have to determine what changed to cause the current problem to appear. Knowing what changed and when is why you need some sort of Change Management and Change Notification processes within your organization.

In solving a problem of this sort, the grand thing is to be able to reason backward.

Ask:

  • When did the problem begin?
  • What activity was going on when the problem began?
  • What was the last change to the system prior to the problem starting?
  • When was the last change applied?

Note Common Symtoms, Causes and Results

  • What do the symptoms have in common?
  • Is there some symptom that is unique to this problem?
  • When you do X, expecting Y, does Z also always happen?
  • If all computers share a common problem, then the probability is very good that the cause is also something they share. If only one computer has the problem, then hte odds are good that the computer itself is the problem.

Interview the User

Ask the user what he is experiencing, but treat this information source with extreme caution. Most users are not technical people and thus make unwarranted conclusions about what is wrong. Users also lie on occasion, especially when they think they might be held responsible for whatever is broken or inoperative.

  • Try to reproduce the user's error
  • Is the problem a real technical failure, or is the computer not doing what the user thinks or expects it should be doing?"

Sometimes the key to fixing a problem is to observe the actual failure as it occurs. It is often a good idea to turn on additional logging or diagnostic modes, run the command in verbose mode or use other diagnostic tools to gather information.

State the Problem

This is the process of reviewing all available information and getting a clear understanding of the perceived failure or dysfunction. Putting the problem into words clarifies exactly what the problem is. The Problem Statement should be very clear about what the problem is, and is not.

The problem statement should include as much of the following information as possible. If you do not have one or more of these, you have not gathered enough information.

  • When the problem started
  • Who is affected, one person, several people, all users
  • Which specific service, function or equipment is down or impaired
  • What action or activity triggers the problem.
  • Where the problem is observed.

Troubleshooting is the science of figuring out the why.

Examples of good Problem Statements:

  • Since Tuesday, all users logging into the Active Directory network have been reporting that they cannot access their personal share because the drive icon is missing.
  • Today--around 1pm, users in the billing department reported that Internet Explorer browser crashes when browsing http://www.blah-blah-website.com.

Form a Hypothesis

After collecting information and clearly stating exactly what the problem is, formulate a theory as to a possible cause--this should take the form of a question.

  • PROBLEM STATEMENT: Since Tuesday, all users logging into the Active Directory network have been reporting that they cannot access their personal share because the drive icon is missing.
    • HYPOTHESIS 1: The file server is down
    • HYPOTHESIS 2: The file server's network connection is down
    • HYPOTHESIS 3: The file server's shared folder is no longer shared
    • HYPOTHESIS 4: The permissions on the share changed
    • HYPOTHESIS 5: The logon script that maps the drive is not running or working correctly.

NOTE: One roadblock to coming up with a good problem statement is not understanding the hardware, technologies and protocols in use. Training is critical to providing superior support and swift troubleshooting.

 

Test the Hypothesis

Once you have stated the problem, devise a method to test your hypothesis of the problem. Each test you perform should follow these simple principles:

  • Change only one parameter or setting at a time.
  • No other parameters should be changed
  • The test you devise should categorically eliminate at least one possible cause.

Observe Test Results & Draw Conclusions

After each test, note whether the change you made did, or did not solve the problem. You must note the results of your test, gather any new information from the system, application or user and draw a conclusion as to whether the problem is solved or whether the change you made had any affect on the problem. Once you have drawn conclusions, you can devise new tests to eliminate other possible causes.

To quote the great Shirlock Holmes:

"Eliminate all other factors, and the one which remains must be the truth."
Chapter 1, p92; "The Sign of the Four" 1890

Translating this addage to modern geek-speak:

..when you've checked the basics and eliminated all configuration and operations-related possible causes, whatever remains is a vendor bug.

 

Repeat as Necessary

The entire troubleshooting process feeds into itself and must be repeated until a solution is found. This troubleshooting method relies on identifying possible causes, categorically eliminating each possible cause until the true, root cause of the problem is found. You cannot find and fix the true root cause of the problem unless you apply the scientific method to your troubleshooting.

 

<< Back to Main Troubleshooting Page


Bookmark this page and SHARE:  

Search

Donations

Free Training