Network CSI Part Two: Return of the Packet Loss
June 20, 2017
In part one of this series, we explained how network troubleshooting is a bit like investigating the scene of a murder. The more clues you have to work with, the easier it is to solve the case. But when you have just a few clues it can be impossible.
To illustrate our example, we showcased a Fortune 500 cosmetics firm that was experiencing packet loss on its network but could not understand where the issue was taking place or why it was happening. The customer eventually drilled down into the network using TotalView and discovered misconfigured QoS queues on a core MPLS router.
The customer contacted its MPLS provider, the problem was fixed, and communications returned to normal — but not for long.
Packet loss started happening again…
Not much time passed before the customer started experiencing call quality issues again. This time, it took the form of latency and jitter during calls.
The customer knew that packet loss was to blame, but had no idea where or why it was occurring. The QoS queues, it should be noted, was found to be in proper working order.
Now the customer was really confused. As we explained in Part One, this is a very large and complex network we were dealing with, and there were multiple service providers involved. So the problem could have been anywhere, for any number of reasons.
Once again, we invite you to pretend you are manually troubleshooting this issue. What do you think the culprit could be? Where would you look for answers?
Again, the customer contacted us for help. Since jitter and latency were occurring, we suspected that the issue was happening at some point in transit. And so this time, we used TotalView to drill down into the customer’s network devices.
Once again, we were right.
We discovered that one of the routers, supplied by the customer’s MPLS WAN provider, was outdated and massively under-powered. The router was running at very high CPU utilization levels. This resulted in packets being delayed and occasionally being dropped along the path, causing the loss and latency..
Note: In many cases, MPLS WAN providers want to keep their hardware costs low for deployments, and utilize lower-end equipment than what might be appropriate for the environment.
The customer was surprised, and rightfully so! The company was equipped with lightning-fast endpoints that were up to date with the latest software. But they were essentially useless because they could not send and receive data in real time: All it took was a small data bottleneck to seriously impact VoIP calls.
One of the important lessons here is that this problem with the CPU spikes on the MPLS edge routers might have taken weeks or months to discover without the right solution deployed to analyze the appropriate metrics. Once the customer had total visibility into the network, and the knowledge of where to look for the root cause of the problem, they were able to solve it and move on to more important business.
So, will the customer’s network remain in proper working order? Stay tuned for part three of this series!