Network troubleshooting is the combined measures and processes used to identify, diagnose and solve problems within a computer network. It’s a logical process that network engineers use to resolve network problems and improve network operations. Troubleshooting is an iterative process, the more data you collect and analyze, the higher the likelihood of developing a correct hypothesis.
Example: A remote site recovers from a power outage. All of the devices come back online, thus the event is perceived to be over. Yet for the next few days, performance in that office seems to be slow. Users in that office have a lot of VoIP call quality problems and call drops, and cloud services seem to crawl and suffer from disconnects. What happened? To fix the issue, you need to troubleshoot.
The Standard Troubleshooting Formula
Having a systematic approach to solving the problem will make you a faster and smarter troubleshooter, and in every network nightmare scenario, the faster, the better. The formula at face value is a simple one. Define, isolate, and solve. Once you have checked the basics like making sure it isn’t a physical-layer problem (is it plugged in?), and that the involved devices respond to ping requests, the real troubleshooting starts. Most troubleshooting involves a rule-in and rule-out process to help narrow down the location and cause of the problem.
- Collect information
- Develop a hypothesis
- Test the hypothesis
- Implement a fix
- Verify the problem was solved
- Notify the users
- Document the fix
If I make this sound simple, it’s because it is, for a simple network. However, while vastly more complicated, the formula remains the same in a complex IT environment.
What Type of Information Should be Collected?
When collecting information on the problem, it is critical to make sure that you know enough about the scope of what part of the network is included in the problem, and what part of the network can be safely excluded. Otherwise, you might be stuck forever trying to collect and analyze information that might be unrelated to the problem.
Start by asking yourself the necessary questions to define the scope of the problem:
- Who is having the problem (one user, multiple users)?
- Is it just one application, or all applications?
- Has anything changed?
- Has this happened before, if so, when?
- Can we reproduce the problem?
- Was anything done differently?
Once the issue is defined, try to isolate it. This involves a process of elimination. If a workstation is having connection difficulties, determine if the problem is isolated to that specific workstation, all workstations in that physical location, or if it’s network-wide. If it’s local, you’ve eliminated a ton of unnecessary work, and you’re much closer to isolating the issue. Even if you haven’t yet solved the problem, you’ve now saved valuable time.
If the problem is related to only one particular application, it can be a valuable clue. For example, if a user has no problems with accessing web applications, but is having VoIP/UC call quality problems, it may be related to queueing or packet loss, or an issue related to a voice gateway or SIP trunk.
Troubleshooting in a Complex Network
In a data center, the sheer number of technologies that could be the cause of your simple support ticket can cause your head to spin. There are times when troubleshooting will account for up to ninety percent of a network admin’s time. No one wants to spend their time continually putting out one fire only to find another, but there isn’t always a choice. Effective troubleshooting tools and procedures enable you to quickly respond to those crises and keep your network operating as designed.
Faster Troubleshooting is Better Troubleshooting
All engineers engage in troubleshooting but getting to the root cause of a problem is key. This is a different activity than just monitoring a network and requires different information to achieve its goal. Organizations that rely solely on monitoring software end up having problems when it comes to troubleshooting issues in their environment.
Troubleshooting a network can be a manual process, or it can be automated. There are network troubleshooting automation tools that help you swiftly identify the root cause and its location essentially completing the first two (most time-consuming) steps so that you can begin working on the solution.
So, what caused the performance problems that resulted from the remote site power outage problem? A duplex mismatch on the Internet router’s WAN link, when it came back online, caused a significant amount of packet loss that caused the slowdowns and call quality issues.
If you want to become a faster, more efficient troubleshooter, check out our white paper on identifying and resolving the root cause of network problems.