Performance Zone is brought to you in partnership with:

Jim spent years on the user side of APM solving problems, fighting fires, and trying to convince all of his APM vendors that they could (and should) do better. His passion for performance tuning and troubleshooting led him from systems and application administration to working as an APM Architect tasked with designing an integrated ecosystem capable of monitoring next generation data centers and the applications housed within. Jim never passed up an opportunity to test drive and provide feedback on (pick apart) an APM vendors offering so he has used most of the tools out there. Jim’s viewpoint is a result of work in a high pressure Financial Services environment but his methods and approach apply to any IT organization that strives for greatness. Jim is a DZone MVB and is not an employee of DZone and has posted 28 posts at DZone. You can read more from them at their website. View Full User Profile

How to Triage a Busy Thread Count Alert in 14 Minutes

07.07.2014
| 1715 views |
  • submit to reddit

This is a real example of troubleshooting a production application issue provided by an AppDynamics customer. What you are about to see is a combination of run time analytics, adaptive data collection, intelligent alerting, and a proven problem solving workflow. From first alert to DBA handoff took only 14 minutes.

5:26 p.m. – Operations receives an email alert about Busy Threads breaching a threshold. The incident was automatically detected and alerted upon by AppDynamics when the Busy Threads JMX metric shot up to 182.

AppDynamics sends notifications detailing busy thread counts

5:34 p.m. – Details from AppDynamics show that call volume is down, response time is up, errors are up and network I/O is down. Initial suspicion is that the load balancer may be throttling traffic due to poor performance.

Thread_ART_Throughput_Errors

Thread_Network

Thread_CPU

5:38 p.m. – Company procedure is followed by disabling the server from the load balancer so that it will not receive any more traffic. Recycle of application server is considered as a possible temporary resolution to the issue.

5:40 p.m. – Details from AppDynamics are used to show that transactions are backing up because of a database issue. There is no need to recycle the application server. The issue is handed off to DBA team with full application context for resolution.

Thread_Txn_Map

Screenshot showing problematic JDBC call as the culprit.

Screenshot showing problematic JDBC call as the culprit.

Later that day: DBA team fixes the issue and application response time returns to normal. All nodes are restored into the load balancer rotation.

This is an example of a scenario that IT Operations teams deal with regularly. Without having AppDynamics in place to provide fault domain isolation this type of problem usually ends up in a long conference call where all support personnel for this application must participate until service has been restored. There is no need to waste significant company resources any more. Stop the “all hands on deck” madness and see how AppDynamics can help your company today.

Published at DZone with permission of Jim Hirschauer, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)