We usually deploy monitoring software to gain visibility into our environment. Typically, the monitoring software will poll network interfaces every 5 minutes and report bandwidth utilization: If utilization is below 90%, then everything is fine, right? Not so fast!
If you are looking at bandwidth every 5 minutes, then what you are looking at is the average utilization over the last 5 minutes.
So if your interface shows 20% utilization, then it could mean that you used 20% continuously over the last 5 minutes (which is fine in most circumstances), or it could also mean that you are pumping 100% traffic for 1 minute and then 0% utilization for 4 minutes. The latter would definitely lead to many unhappy users!
You might think that the solution would be to poll the information more frequently. Although having a great deal of polling traffic used to be a concern when T1 circuits with 1.5mbps of bandwidth were the WAN standard, it is not as much of an issue anymore with 10meg and 100meg WAN circuits. But if you polled every 30 seconds, you would add 10x the amount of polling traffic. Bandwidth used for the polling may not be your limiting factor, but the CPU utilization of the switch or router may be.
Even with 30second polls, you may not see the true picture of what is going on. A gigabit circuit that spikes at 100% utilization for 5 seconds and then is at 0% utilization for the remaining 25 seconds will still only show as 16.66% utilization, but there will be an application that was slowed down.
So the real trick is to be aware of high utilization spikes on links without having to poll more often than required.
In this case, we want to look at what the network does when it sees 100% utilization on a link. When a link hits 100% saturation, it has to buffer packets that can’t be transmitted. Those buffered packets will be transmitted as soon as bandwidth is available. If the link is heavily saturated for a longer period of time, then the buffers will fill, and then the interface will start to discard packets when there are no more buffers available.
Tracking outbound buffer usage on interfaces is the easiest way to determine when you run out of bandwidth. You can check this by looking at the SNMP dot3StatsDeferredTransmissions counter. If the counter increments, then it had to buffer a packet that could not be transmitted because the physical medium was busy.
Tracking outbound packet discards is the easiest way to determine the severity of the overage. You can check this by looking at the SNMP ifOutDiscards counter for the interface. This counter will increment when packets could not be transmitted, and there was no ability to place them in a buffer (perhaps because the buffers were full).
These two SNMP variables may or may not be present via the device’s command line, but should always be available via direct SNMP queries.
The benefit of using the SNMP dot3StatsDeferredTransmissions and SNMP ifOutDiscards variables over standard utilization tracking is that they can help find micro-bursts of traffic that might create congestion for a second or two before clearing up. These microbursts can cause slowdowns or glitches in the network for certain applications that don’t deal well with resource contention like VoIP, video, or VDI.
See also: What is a Network Microburst