Over at the Ticketmaster tech blog, Audyn Espinoza and I co-authored a post talking about how our investment in monitoring tools paid off and helped us fix an irksome performance issue with our web services:
At Ticketmaster weâ€™re passionate about monitoring our production systems. As a result, we occasionally come across interesting issues affecting our services that would otherwise go unnoticed. Unfortunately, monitoring only indicates the symptoms of whatâ€™s wrong and not necessarily the cause. Digging in deeper and getting to the root cause is a whole different ball game. This is one such example.
This story starts out with the observation that one of our web service calls had an unusually high number of timeouts. It was particularly unusual because the web service in question typically responds in about 50 ms, and the client times out at 1s. To add to that, the metrics at the web service level was still reporting a 99th percentile response time in the 50ms range. The issue had to be in the network between the client and the service.
We took a closer look at the metrics on the client side and a pattern emerged that we had missed earlier:
Time chart of web service response times as observed at the client
For a given cluster, the timeouts were occurring every minute at the same second mark. For example, on cluster A, timeouts would occur at 5:02:27, 5:03:27, 5:04:27, and at 5:02:55, 5:03:55, 5:04:55 on another cluster. While perplexing and a great data point, we were still nowhere close to the root cause. It was time for tcpdump.
Michael Connelly offers up a cautionary devops tale of what can happen when your alerts are too sensitive or generate too much noise:
â€œThe vaultâ€™s sensor alarm had repeatedly been going off all week. [The thieves], with their digging and their drills, must have been tripping the alarms. Four straight nights the cops are called out along with the manager. Sometimes three times in one night. They donâ€™t find anything and begin to think itâ€™s the alarm. The sound-and-movement sensor is off balance. So the manager calls the alarm company and they canâ€™t get anybody out until after the holiday weekend, you know, Labor Day. So this guy, the managerâ€”â€
â€œTurns the alarm off.â€ Bosch finished for her.
â€œYou got it. He decides he isnâ€™t going to get called out each night during the weekend. Heâ€™s supposed to go down to the Springs to his time-share condo and play golf. He turns the alarms off. Of course, he no longer works for WestLand National.â€
From The Black Echo – the first book in Connelly’s brilliant Harry Bosch series.