Diagnosing problems with ping

The two most commonly used tools, and almost always the first ones used to diagnose a network problem, are traceroute and ping. However, the results they return are often misinterpreted or interpreted in a way that leads to the wrong conclusion.

Let’s take specifically the ping utility. The common mistake that is made is that whatever the result of the ping is because of the purpose of the ping. For example, if there is no response from ping; conclude that the site is down. Or if there is packet loss or long return times, conclude that it is due to some problem with the destination address. While both results could be the case, more often than not, they are completely wrong conclusions to draw.

Common causes of this misinterpretation are:

Ping sends a packet to the destination address that will typically go through several other points on the network to get there. A problem at any of those points will cause a non-response to the ping query.
In many cases, websites and other servers are protected by firewalls, and many, if not most, firewalls block ping packets. So while web traffic may reach the site, ping packets may not.
The ping packet has a source (the system initiating the ping) as well as a destination, it may be that the source does not have a correct route path to the destination, or the destination does not have a correct return route path to the source. This could be due to specific firewall rules, an error in the route tables “somewhere” along the data path, or a specific routing policy deliberately set to block access.

The traceroute command can be used to help detect whether 1 or 3 are the cause of the problem, which has its own issues, but more on that later. A positive result from telnet and tcptraceroute will definitely rule out 2. as a possible case.

Telnet can be used to open a connection on any port, not just the default telnet port. A successful telnet connection where the ping has failed is proof positive that a firewall is preventing access to the ping packets. Here’s an example:

$ ping cisco.com PING cisco.com (198.133.219.25) 56(84) bytes of data.

— cisco.com ping statistics —

6 packets transmitted, 0 received, 100% packet loss, time 5008ms

$telnetcisco.com 80

Trying 198.133.219.25…

Connected to cisco.com.

The escape character is ‘^]’.

You can see that the ping packet failed, but Telnet to port 80 was able to connect to the server.

So also with tcptraceroute on port 80:

$tcptraceroute cisco.com 80

traceroute to cisco.com (198.133.219.25), max 30 hops, 40-byte packets

1 192.168.6.254 (192.168.6.254) 8.557ms 10.624ms *

….

15 cisco.com (198.133.219.25) 289.162ms 237.972ms 242.171ms

Another common mistake when using ping is that the results of only a few ping tests are indicative of the condition of a data path. It may be true, but such a conclusion can only be trusted over a statistically significant sample size. Also, to be truly accurate, it is necessary to know the distribution of responses from packets outside the acceptable level.

For example, as a single ping test of four packets in which one packet is dropped, it cannot be used meaningfully to conclude that there is a 25% packet loss on that circuit. Ten thousand ping tests, over several hours where 5% is lost has much more meaning; however, consider if the test was performed for 24 hours and for one hour the target site was down. 100% loss for that hour looks like an overall 5% packet loss for 24 hours.

Therefore, it is important to review the ping test log and see if the distribution of any lost packets is regular or limited to a specific period, before any real conclusion can be drawn.

A third common mistake is that the cause of the result obtained is due to the destination site. For example, let’s say a 5% packet loss was found when pinging 3com.com, this in no way indicates that the problem lies with that site, rather the problem could be anywhere along the way. along the data path to that site, including the source (my own computer):

$ traceroute 3com.com traceroute to 3com.com (192.136.34.41), max 30 hops, 40-byte packets

1 192.168.6.254 (192.168.6.254) 10.285ms 13.316ms 14.440ms

2 129.1.233.220.exetel.com.au (220.233.1.129) 132.994ms 135.387ms 136.312ms

3 241.0.233.220.exetel.com.au (220.233.0.241) 137.192ms 141.296ms 162.018ms

4 10.0.1.1 (10.0.1.1) 168.530ms 174.358ms 176.908ms

5 38.2.233.220.exetel.com.au (220.233.2.38) 177.729ms 188.233ms 189.122ms

6 359-ge-0-0-0.GW5.SYD2.ALTER.NET (203.166.92.57) 197.691ms 85.598ms 156.625ms

7 0.so-0-2-0.XR3.SYD2.ALTER.NET (210.80.33.189) 158.108ms 159.430ms 160.260ms

8 0.so-4-3-0.IR1.LAX12.ALTER.NET (210.80.50.249) 305.124ms 305.952ms 306.775ms

9 0.so-5-0-0.IL1.LAX9.ALTER.NET (152.63.48.65) 313.518ms 321.047ms 321.868ms

10 0.so-5-0-0.XT1.SAC1.ALTER.NET (152.63.0.98) 405.111ms 406.359ms 407.241ms

11 GigabitEthernet6-0-0.GW9.SAC1.ALTER.NET (152.63.55.73) 331.091ms 337.600ms 341.527ms

12 eds-gw.customer.alter.net (63.114.61.154) 357.930ms 287.765ms 310.755ms

13 205.141.209.3 (205.141.209.3) 311.606ms 312.502ms 313.587ms

14 10.231.1.2 (10.231.1.2) 341.277ms 342.101ms 342.931ms

15 205.141.209.133 (205.141.209.133) 344.380ms 345.861ms 346.689ms

16 ip-192-136-34-41.ip.3com.com (192.136.34.41) 261.317ms 266.998ms 346.689ms

You can clearly see the number of hops the data needs to go through. In this case, there is no evidence of any problem along the data path. But if the traceroute looked like this:

$ traceroute 3com.com traceroute to 3com.com (192.136.34.41), max 30 hops, 40-byte packets

1 192.168.6.254 (192.168.6.254) 10.285ms 13.316ms 14.440ms

2 129.1.233.220.exetel.com.au (220.233.1.129) 132.994ms 135.387ms 136.312ms

3 241.0.233.220.exetel.com.au (220.233.0.241) 137.192ms 141.296ms 162.018ms

4 10.0.1.1 (10.0.1.1) 168.530ms 174.358ms 176.908ms

5 38.2.233.220.exetel.com.au (220.233.2.38) 177.729ms 188.233ms 189.122ms

6 359-ge-0-0-0.GW5.SYD2.ALTER.NET (203.166.92.57) 197.691ms 85.598ms 156.625ms

7 0.so-0-2-0.XR3.SYD2.ALTER.NET (210.80.33.189) 758.108ms 759.430ms *

8 0.so-4-3-0.IR1.LAX12.ALTER.NET (210.80.50.249) * * 806.775ms

9 0.so-5-0-0.IL1.LAX9.ALTER.NET (152.63.48.65) 813.518ms*721.868ms

10 0.so-5-0-0.XT1.SAC1.ALTER.NET (152.63.0.98) * 1406.359ms 1007.241ms

11 GigabitEthernet6-0-0.GW9.SAC1.ALTER.NET (152.63.55.73) 731.091ms 737.600ms 1341.527ms

12 eds-gw.customer.alter.net (63.114.61.154) 357.930ms * *

13 205.141.209.3 (205.141.209.3) 811.606ms 812.502ms 813.587ms

14 10.231.1.2 (10.231.1.2) 741.277ms 742.101ms 1342.931ms

15 205.141.209.133 (205.141.209.133) * * 746.689ms

16 ip-192-136-34-41.ip.3com.com (192.136.34.41) 761.317ms 866.998ms*

It would be reasonable to conclude that there was some serious problem between hop 6 and hop 7 that is causing the ping test to return a lossy result.

To conclude, we can see that ping:

is a useful tool to indicate where a problem may be
should be used in combination with other tests to eliminate false positives
should not be used for small, isolated tests 4. is a good indicator of problems over sadistically significant sample sizes