Category: technology

The Case of the Recurring Network Timeout

Crossposted from the now defunct Ticketmaster tech blog (RIP, and co-authored with Audyn Espinoza

At Ticketmaster we’re passionate about monitoring our production systems. As a result, we occasionally come across interesting issues affecting our services that would otherwise go unnoticed. Unfortunately, monitoring only indicates the symptoms of what’s wrong and not necessarily the cause. Digging in deeper and getting to the root cause is a whole different ball game. This is one such example.

This story starts out with the observation that one of our web service calls had an unusually high number of timeouts. It was particularly unusual because the web service in question typically responds in about 50 ms, and the client times out at 1s. To add to that, the metrics at the web service level was still reporting a 99th percentile response time in the 50ms range. The issue had to be in the network between the client and the service.

We took a closer look at the metrics on the client side and a pattern emerged that we had missed earlier:

Time chart of web service response times as observed at the client

Time chart of web service response times as observed at the client

For a given cluster, the timeouts were occurring every minute at the same second mark. For example, on cluster A, timeouts would occur at 5:02:27, 5:03:27, 5:04:27, and at 5:02:55, 5:03:55, 5:04:55 on another cluster. While perplexing and a great data point, we were still nowhere close to the root cause. It was time for tcpdump.

We started snooping the chatter on the client node and on a web service node and waited patiently for the second mark where the timeouts occur.

tcpdump -i eth0 -s 0 -xX port 8080 src <client_address> and dst <service_address>

Packet trace captured at clientPacket trace captured at client

Packet trace captured at serverPacket trace captured at server

Notice that the client sends the SYN packet, waits for the SYN/ACK from the server doesn’t get it and re-sends the SYN packet 3s later (3s is the default timeout for re-transmission). However, the server behaves correctly and sends the SYN/ACK in a timely manner, which the client does not receive. It was time to bring in the big guns – enter Audyn Espinoza, our resident network expert.

Based on the information already gathered, we determined that the problem occurred specifically when traversing the hardware firewall devices. This was true whether we tested within the same datacenter or between data centers, which ruled out the WAN MPLS as a possible culprit. Considering that the problem was always related to establishing a TCP connection, we immediately suspected the firewall layer. The fact that firewalls (and load balancers) are the only kind of network devices that understand the preamble and the end of a TCP network connection factored into our suspicions. How was it possible that a router or switch can randomly drop only the SYN/ACK packet and not normal data packets if routers and switches cannot tell the difference? Possible, but very improbable. We needed to go deeper and perform packet analysis on these devices to find the culprit.

With that said, we began our search by capturing data using OPNET along with Wireshark. Initially we started out by capturing traffic between systems across the two data centers. We immediately noticed a successful SYN packet traversing the firewall but the return SYN/ACK packet gets dropped by the firewall. The flow can be seen in the below network diagram:

Network Diagram between client and server where timeouts were observedNetwork Diagram between client and server where timeouts were observed

The same test was performed from the opposite direction with the same result. We can see the system properly responds with a SYN/ACK packet but it is dropped by its local firewall.  Obviously, we validated the necessary ACL was in place so something outside of the policy was dropping this packet.

While, we now had circumstantial evidence that the issue had to be the firewall, we needed to be convinced. For instance there was the possibility – however remote – that the issue could have been at the MPLS. So, we decided to perform another test, this time within the same datacenter where the packets did not cross the MPLS. In addition, we performed the network trace at the port-channels that connect directly to the physical firewall. It involved a client from Network1 attempting to establish a TCP connection to a server on Network3. If you look at the picture below, there are 3 separate VLANs on the layer3 switch and 2 Virtual firewalls that the client needs to pass through in order to reach the server.

Network diagram for detailed packet analysisNetwork diagram for detailed packet analysis

The purple line shows the SYN packet originating from the client and it crosses all 3 networks successfully. We were able to see all three instances of the same packet in our Wireshark captures.

We should expect to see three SYN/ACK packets passing through all 3 networks, but we only see one instance on our captures which tells us the Virtual Firewall “B” in this case was not sending it out into the Network 2.

3 seconds after TCP’s RTO timer expires, a SYN packet retransmission is sent. Based on our packet captures, this time the TCP connection was successfully established b/c we saw 3 SYN packet instances traversing all 3 networks as well as 3 SYN/ACK packet instances.

This was conclusive and indisputable evidence that the cause of the problem existed at the firewall layer. Now that we knew the firewall was indeed our culprit, we needed to focus on it to find the smoking gun.

After an hour’s worth of looking at the firewall stats, we found that every 60 seconds SNMP was hogging the CPU. We were able to see this at the system level of the firewall (not at the virtual firewall level). This is exactly the same time that spikes were observed on the web service graphs.

The root cause of this problem turned out to be an SNMP bug within the firewall code. To completely solve the TCP connection timeout problem, we disabled SNMP polling on both the firewalls and the SNMP poller.

This is a great example of how packet analysis and our monitoring tools helped Ticketmaster identify and fix an issue.

Podcasts And Happiness

Recent studies on happiness conclude that longer commutes have a direct negative impact on levels of  happiness:
According to the calculations of Frey and Stutzer[1], a person with a one-hour commute has to earn 40 percent more money to be as satisfied with life as someone who walks to the office. Another study[2], led by Daniel Kahneman and the economist Alan Krueger, surveyed nine hundred working women in Texas and found that commuting was, by far, the least pleasurable part of their day.

That should have made me downright miserable, considering that my new job has increased my commute by about 4000%. I went from walking across my street to driving about 40 minutes each way, in gnarly LA traffic no less.

Luckily, I’ve discovered podcasts – albeit a decade too late – to keep me sane. I’ve mostly stuck to Martin Fowler’s recommendations with the addition of Freakonomics radio. By far the most consistently interesting podcast I’ve come across is RadioLab. The episode Detective Stories with three stories around digging up the past is particularly captivating.

While I believe that podcasts have ensured that my happiness has not dropped or may even have increased, they’ve brought problems of their own. There are times when I realize that I’ll be reaching my destination before the end of an episode, resulting in me quixotically willing the traffic lights to turn red and hoping that my commute was a wee bit longer. File that one under first world problems.

I’ve tentatively settled on Google’s Listen app as my podcast delivery vehicle of choice on Android. It’s buggy and not supported any more by Google, but it syncs with my Google Reader account. I would happily pay for the BeyondPod app if I could figure out how to sync a specific folder within Google Reader.

Sidebar: While searching for the study linking happiness to commute lengths, the first result was Jonah Lehrer’s post (linked above). Turns out that Jonah is also the contributor to RadioLab and contributed to my other favorite episode which randomly enough happens to about the pervasiveness of randomness in our lives.

[1] Stress That Doesn’t Pay: The Commuting Paradox by Stutzer and Frey

[2] Developments in the Measurement of Subjective Well-Being by Kahneman and Krueger