|By Wolfgang Gottesheim||
|October 27, 2012 05:00 PM EDT||
Does the following situation sound familiar? From one minute to the other, your production servers grind to a halt, terse emails are complemented by equally hectic phone calls, and the first order of business is to get back up and running. After the dust settles, you're usually left with a pile of log files and the assignment of figuring out what happened, why it happened, and what to do to keep it from happening again.
A common first step is trying to reproduce what has gone wrong. More often than not, this consumes a considerable amount of time that would be better spent on actually fixing the problem. In this first blog post of a series, I will present a Step-by-Step Guide to Diagnose Stuck Transactions within minutes and show how a modern APM Solution helps to pinpoint common production problems, without spending hours on reproducing it at first.
The Problem: Response Time Increases
One of our customers deployed a new version of their prime web application and, for the first couple of days, everything seemed fine. One evening, their operations team was alerted about significantly increasing response time, and upon further investigation they recognized that a Stuck Transaction was blocking their application. While restarting the server solved the problem from an operations perspective it is not a long term solution as currently processing transactions get canceled and users are thrown off the system.
Their APM solution automatically captures thread dumps in case of too long running or stuck transactions. These dumps assist developers with diagnosing the issue. Let's take this example and walk through one of the problems often seen in a production environment: Diagnosing Stuck Transactions and Identifying the Root Cause.
Step 1: Identify problematic JVM/CLR
To identify the correct thread dump to analyze, it's important to know which server was affected by the stuck transaction and at what time that happened.
The transaction flow indicates a problem on one of the application servers
When drilling to the actual transactions that flow through that application server and focusing on the timeframe just before the server got restarted, I noticed a timed-out PurePath that spent 100% in sync time. This means that all it did was wait for one or more monitors, which makes it a likely culprit for our stuck transaction.
The PurePath reveals the information about which Application Server (=Agent) was involved in that stuck transaction
Step 2: Identify blocked threads
Having identified the affected application server, we can browse the list of available thread dumps and open the corresponding one.
Thread dump overview on runnable, blocked, waiting and timed waiting threads
When looking at the thread dump, threads can be grouped by their state to get a quick overview on how many threads the application executed when the dump was performed and how many of them were actually blocked. In this case, we are looking at four threads as Figure 4 shows.
Out of the 600+ Threads in the application, 4 are blocked and subject for further investigation
Step 3: Identify root cause
Already the first one of these threads is the thread from our timed-out PurePath we looked at in Step 1 - identifiable by its ID. We also get the most important piece of information we need to identify the root cause: Which object is this thread waiting on, and who owns it?
We now know which method is blocking and that it is blocked because it waits for an object owned by another thread -> a potential deadlock?
Usually, the thread owning the object we are waiting for is highlighted in red. Given the number of threads in this dump, it's unlikely to spot it this way, but search for the ID of the owning thread giving us the following information:
The HistoryLayout thread owns a monitor object with two waiting threads, which causes the web request handler to block and run into a timeout for the end user - the deadlock is identified!
In this case, the method com.vaadin.ui.Table.unregisterPropertiesAndComponents is working on the same instance of ConfigurableReportsApplication that AbstractApplicationPortlet.handleRequest is waiting for. Having identified the thread, we have the full stack trace at hand in the lower pane, and can track down how this situation occurred.
Trying to reproduce such concurrency-related problems on a developer machine with typically requires a major effort. If your APM tool fully supports collaboration between teams, all analysis steps described above can be performed in your local environment, without direct access to production servers. Using an APM Solution like dynaTrace, it only took a couple of minutes to answer the crucial questions named above: we know exactly what happened and how it happened, and are thus also able to modify our application in a way to avoid this situation in the future.
@DevOpsSummit at Cloud taking place June 6-8, 2017, at Javits Center, New York City, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long developm...
Dec. 9, 2016 10:15 AM EST Reads: 1,921
Today’s IT environments are increasingly heterogeneous, with Linux, Java, Oracle and MySQL considered nearly as common as traditional Windows environments. In many cases, these platforms have been integrated into an organization’s Windows-based IT department by way of an acquisition of a company that leverages one of those platforms. In other cases, the applications may have been part of the IT department for years, but managed by a separate department or singular administrator. Still, whether...
Dec. 9, 2016 09:45 AM EST Reads: 549
Get deep visibility into the performance of your databases and expert advice for performance optimization and tuning. You can't get application performance without database performance. Give everyone on the team a comprehensive view of how every aspect of the system affects performance across SQL database operations, host server and OS, virtualization resources and storage I/O. Quickly find bottlenecks and troubleshoot complex problems.
Dec. 9, 2016 08:30 AM EST Reads: 2,228
As we enter the final week before the 19th International Cloud Expo | @ThingsExpo in Santa Clara, CA, it's time for me to reflect on six big topics that will be important during the show. Hybrid Cloud: This general-purpose term seems to provide a comfort zone for many enterprise IT managers. It sounds reassuring to be able to work with one of the major public-cloud providers like AWS or Microsoft Azure while still maintaining an on-site presence.
Dec. 9, 2016 04:45 AM EST Reads: 2,983
I’m a huge fan of open source DevOps tools. I’m also a huge fan of scaling open source tools for the enterprise. But having talked with my fair share of companies over the years, one important thing I’ve learned is that you can’t scale your release process using open source tools alone. They simply require too much scripting and maintenance when used that way. Scripting may be fine for smaller organizations, but it’s not ok in an enterprise environment that includes many independent teams and to...
Dec. 9, 2016 02:45 AM EST Reads: 794
Between 2005 and 2020, data volumes will grow by a factor of 300 – enough data to stack CDs from the earth to the moon 162 times. This has come to be known as the ‘big data’ phenomenon. Unfortunately, traditional approaches to handling, storing and analyzing data aren’t adequate at this scale: they’re too costly, slow and physically cumbersome to keep up. Fortunately, in response a new breed of technology has emerged that is cheaper, faster and more scalable. Yet, in meeting these new needs they...
Dec. 9, 2016 01:45 AM EST Reads: 1,967
More and more companies are looking to microservices as an architectural pattern for breaking apart applications into more manageable pieces so that agile teams can deliver new features quicker and more effectively. What this pattern has done more than anything to date is spark organizational transformations, setting the foundation for future application development. In practice, however, there are a number of considerations to make that go beyond simply “build, ship, and run,” which changes how...
Dec. 9, 2016 12:45 AM EST Reads: 5,139
In his general session at 19th Cloud Expo, Manish Dixit, VP of Product and Engineering at Dice, discussed how Dice leverages data insights and tools to help both tech professionals and recruiters better understand how skills relate to each other and which skills are in high demand using interactive visualizations and salary indicator tools to maximize earning potential. Manish Dixit is VP of Product and Engineering at Dice. As the leader of the Product, Engineering and Data Sciences team at D...
Dec. 9, 2016 12:45 AM EST Reads: 1,230
SYS-CON Events has announced today that Roger Strukhoff has been named conference chair of Cloud Expo and @ThingsExpo 2017 New York. The 20th Cloud Expo and 7th @ThingsExpo will take place on June 6-8, 2017, at the Javits Center in New York City, NY. "The Internet of Things brings trillions of dollars of opportunity to developers and enterprise IT, no matter how you measure it," stated Roger Strukhoff. "More importantly, it leverages the power of devices and the Internet to enable us all to im...
Dec. 9, 2016 12:30 AM EST Reads: 918
In IT, we sometimes coin terms for things before we know exactly what they are and how they’ll be used. The resulting terms may capture a common set of aspirations and goals – as “cloud” did broadly for on-demand, self-service, and flexible computing. But such a term can also lump together diverse and even competing practices, technologies, and priorities to the point where important distinctions are glossed over and lost.
Dec. 8, 2016 09:15 PM EST Reads: 1,682
Monitoring of Docker environments is challenging. Why? Because each container typically runs a single process, has its own environment, utilizes virtual networks, or has various methods of managing storage. Traditional monitoring solutions take metrics from each server and applications they run. These servers and applications running on them are typically very static, with very long uptimes. Docker deployments are different: a set of containers may run many applications, all sharing the resource...
Dec. 8, 2016 07:45 PM EST Reads: 5,760
Logs are continuous digital records of events generated by all components of your software stack – and they’re everywhere – your networks, servers, applications, containers and cloud infrastructure just to name a few. The data logs provide are like an X-ray for your IT infrastructure. Without logs, this lack of visibility creates operational challenges for managing modern applications that drive today’s digital businesses.
Dec. 8, 2016 05:00 PM EST Reads: 1,824
You have great SaaS business app ideas. You want to turn your idea quickly into a functional and engaging proof of concept. You need to be able to modify it to meet customers' needs, and you need to deliver a complete and secure SaaS application. How could you achieve all the above and yet avoid unforeseen IT requirements that add unnecessary cost and complexity? You also want your app to be responsive in any device at any time. In his session at 19th Cloud Expo, Mark Allen, General Manager of...
Dec. 8, 2016 04:45 PM EST Reads: 1,873
Financial Technology has become a topic of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 20th Cloud Expo at the Javits Center in New York, June 6-8, 2017, will find fresh new content in a new track called FinTech.
Dec. 8, 2016 04:45 PM EST Reads: 2,272
@DevOpsSummit taking place June 6-8, 2017 at Javits Center, New York City, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @DevOpsSummit at Cloud Expo New York Call for Papers is now open.
Dec. 8, 2016 04:30 PM EST Reads: 1,994
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
Dec. 8, 2016 04:15 PM EST Reads: 2,343
"Dice has been around for the last 20 years. We have been helping tech professionals find new jobs and career opportunities," explained Manish Dixit, VP of Product and Engineering at Dice, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Dec. 8, 2016 02:15 PM EST Reads: 1,211
Rapid innovation, changing business landscapes, and new IT demands force businesses to make changes quickly. In the eyes of many, containers are at the brink of becoming a pervasive technology in enterprise IT to accelerate application delivery. In this presentation, attendees learned about the: The transformation of IT to a DevOps, microservices, and container-based architecture What are containers and how DevOps practices can operate in a container-based environment A demonstration of how ...
Dec. 8, 2016 01:15 PM EST Reads: 1,238
Cloud Expo, Inc. has announced today that Andi Mann returns to 'DevOps at Cloud Expo 2017' as Conference Chair The @DevOpsSummit at Cloud Expo will take place on June 6-8, 2017, at the Javits Center in New York City, NY. "DevOps is set to be one of the most profound disruptions to hit IT in decades," said Andi Mann. "It is a natural extension of cloud computing, and I have seen both firsthand and in independent research the fantastic results DevOps delivers. So I am excited to help the great t...
Dec. 8, 2016 01:15 PM EST Reads: 770
Without lifecycle traceability and visibility across the tool chain, stakeholders from Planning-to-Ops have limited insight and answers to who, what, when, why and how across the DevOps lifecycle. This impacts the ability to deliver high quality software at the needed velocity to drive positive business outcomes. In his general session at @DevOpsSummit at 19th Cloud Expo, Phil Hombledal, Solution Architect at CollabNet, discussed how customers are able to achieve a level of transparency that e...
Dec. 8, 2016 12:45 PM EST Reads: 1,277