I am Winston, your typical, overly responsible, undervalued engineer—the one with the beard. I try to bring calm and peace to my company and everyone I work with. My company is growing, full of people, cubicles and continuous problems and challenges.
The Tuesday started out pretty quiet. My job as a Network Admin had a only few low-priority tickets to deal with, so it was a delightful day so far.
While getting a refill of coffee in the break room, the Telecom VP walked up and asked if I was aware of the problems that were occurring at the call center. At this point, my chest tightened as I realized that my day was going from delightful to suck in 5 seconds.
“What problems?” I asked timidly.
“Most of the agents are complaining about call quality problems, even the ones working from home,” retorted the VP.
I asked if a ticket was opened on the issue, and the VP told me he wanted to personally find out what was happening, and he would be visiting the CIO in a few moments. Another stressor, but at least I was aware before the CIO was aware.
I raced back to my desk and checked my network monitoring system. This was an old-school system that was designed a few decades ago to monitor servers and ping network devices and get utilization stats on WAN links. It indicated that everything was healthy.
At this point, the CIO was gathering everyone to the large conference room to discuss the issue and start triage, since it was affecting a large part of the business.
The Telecom VP was there along with the rest of the networking team. The meeting started with the Telecom VP talking about the extent of the problem, and how much it was costing the business per minute.
It seems that almost all of the call center agents are having call quality problems with customers. My mind raced to think of what was common to those users and customers. Sadly, my network diagram was at least 5 months out of date because I didn’t have time to manually update it.
The meeting quickly devolved into different engineers guessing at possibilities and then their ideas being dismissed due to the clues not matching up.
The Telecom VP then got a call and told the group that the problem seems to have mysteriously disappeared, and everything was working smoothly again. I breathed a sigh of relief, as we now had a reprieve from the highly stressful situation, and had proper time to research what might have happened.
The only drawback is that with the problem being gone, it would be hard to spot or provoke or prove that any one particular thing was the problem.
The CIO asked to have a status by the end of the day of what could have caused the problem in case it returns. My brain came up with hundreds of possibilities, with over a thousand locations as to where the problem could exist, and I figured I needed lunch to tame my thoughts and focus on the most likely sources.
When I returned from lunch, I noticed the Telecom VP and my CIO loitering near my cubicle and figured that the problem had returned. I wasn’t in any better position to tell them what was happening, and I needed to come up with some guesses to help make them go away, so I could work on the problem without them standing right next to me while I worked.
“It could be the new router’s configuration that we implemented this morning,” I guessed. “We can revert back to the previous configuration and see if the problem goes away.”
The VP looked displeased. “You’re responsible for the network—Don’t you know what’s going on?” he retorted.
I logged in and made the changes and asked them to check with the agents to see if their quality problems improved. In my heart, I really didn’t think that the router configuration would make a difference, but it was one thing that did change in the network, so it had at least some suspicion.
I then went back to the Telecom VP and asked if things were better, and he said that things were better but we dropped a bunch of calls and now have to deal with the repercussions.
I notified the CIO that the problem was solved and that the Telecom VP was informed.
As I headed back to my desk, I started going through the new router configuration and realized that there was nothing in the config that would have caused a call quality problem. I started to manually diagram the network configuration and then realized that this router did not even convey any VoIP or signaling traffic; this problem was bound to happen again, and it was not actually solved.
Wednesday was uneventful, but by Thursday morning the Telecom VP and CIO wanted to know what happened, and why the problem has returned. They wanted to know if the router configuration was “acting up” again. I had to admit that I determined that the router’s configuration had no effect on the problem, and that it disappeared on its own, and that it was not solved yet.
Another meeting was called and everyone on the team had their input, but nobody had any new leads. The executives were worried that nobody had any new ideas. “Why doesn’t our monitoring software tell us what’s broken, and where?” they asked.
I had to tell them that it’s just not designed to do that – it can tell us about outages and show us utilization of our Internet links, but not much more than that.
It was at this point that the executives told us to get a better monitoring solution – one that COULD tell us what breaks, where, and why, so problems like this wouldn’t happen again.
From a friend’s recommendation, I deployed PathSolutions TotalView on the network. It was pretty quick and easy to set up. Within the first hour, it was telling us things about our environment that the network equipment knew about, but our old monitoring software had no awareness of.
It found bad cables that were dropping packets, interfaces that were running at incorrect speeds, and route paths that were sub-optimal.
And the most important piece was that it found that our session border controller was running 100meg HALF-Duplex. Whenever traffic levels got too high on this link, it would start to have collisions and cause calls to have poor quality and eventually drop. With low traffic levels, everything was fine.
We got the problem solved the same day we deployed the solution, and then started to proactively fix all of the other problems in our network. By the end of the first week of deployment, both the Telecom VP and CIO dropped by and asked what I did to improve the network’s speed, performance and quality? I told them all of the fixes that I was able to implement as a result of knowing everything the network equipment knew.
I only wish I had TotalView years ago so I wouldn’t have to spend the last few years doing manual troubleshooting.