|By Michael Kopp||
|November 4, 2012 09:00 AM EST||
An eCommerce site that crashes seven times during the Christmas season being down for up to five hours each time it crashes is a site that loses a lot of money and reputation. It happened to one of our customers who told this story at our annual performance conference earlier this month. Among the several reasons that led to these crashes I want to share more details on one of them that I see more often with other websites as well. Load balancers on a round-robin instead of least-busy can easily lead to app server crashes caused by heap memory exhaustion. Let's dig into some details on how to identify these problems and how to avoid it.
The Symptom: Crashing Tomcat Instances
The website is deployed on six Tomcats with three front-end Apache Web Servers. During peak load hours individual Tomcat instances started showing growing response times and a growing number of requests in the Tomcat processing queue. After a while these instances crashed due to out-of-memory exceptions and with that also brought down the rest of the site as load couldn't be handled any more with the remaining servers. Figure 1 shows the actual flow of transactions through the system highlighting unevenly distributed response time in the application servers and functional errors being reported on all tiers (red-colored server icon):
Even with equally distributed load (Round Robin Load Balancer Setting) one of the Tomcats spiked in Response Time Contribution before crashing
Once the App Server started rejecting incoming connections we can observe the first ripple effect of errors. We can see a very high number of exceptions in the database layer, exceptions thrown between application tiers with the web app responding with HTTP 500s:
Within 30 minutes the application serves 43000 pages with an HTTP 500 Response correlating to Exceptions in the Database and Inter-Tier Communication
The Root Cause: Inefficient Database Statements and Connection Pool Usage
The exceptions caught in the Database Layer (JDBC) were already a very good hint of the root cause of this problem. A closer look at the Exceptions shows that connection pools are exhausted, which causes problems in the different components of the application:
Exhausted Connection Pool causes Exceptions that impact Data Access Layer as well as Widget Rendering
Looking at the performance breakdown by application layer reveals how much performance impact connection pooling has on the overall transaction response time:
Due to the connection pool problem a single request had to wait 3.8s on average to obtain a connection from the pool
Now - it was not only the size of the pool that was the problem - but - several very inefficient database statements that took long to execute for certain business transactions of the application. This caused the application server to hold on to the connection longer than normal. As the load balancer was configured with Round Robin the app server still got additional requests served. Eventually - just by the random nature of incoming requests - one app server received several of these requests that executed these inefficient database calls. Once the connection pool was exhausted the application started throwing exceptions that ultimately also led to a crash of the JVM. Once the first app server crashed, it didn't take too long to take the other app servers down as well.
The Solution: Optimizing App and Load Balancer
The problem was fixed by looking at the slowest database statements and optimizing them for performance by, e.g, adding indices on the database or making the SQL statements more efficient. They also optimized the pool size to accommodate the expected load during peak hours.
They started by optimizing SQL Statements that took long to execute and those that got executed several times within the same transaction
They also changed the load balancer setting from Round-Robin to Least-Busy, which was the preferred setting from the LB vendor but had simply forgotten to configure in the production environment.
The Result: Site Has Not Been Down Since
Since they made the changes to the application and the load balancer the site has never gone down since. Now - the next holiday season is coming up and they are ready for the next seasonal spikes. Even though they are really confident that everything will work without any problems, they learned their lesson and are approaching performance proactively through proper load testing.
Next Steps: Proactive Performance Management
The lesson learned was that these problems could have been found prior to the holiday shopping season by doing proper load testing. They did load testing before but never encountered this problem because of two reasons: 1) they didn't test using expected peak volumes for long enough sessions and 2) didn't use a tool that simulated real customer behavior variations (too few scripts and the scripts were too simple) that tested their highly interactive web site.
Their strategy for proactive performance management is that they:
- Perform load tests on the production system during low traffic hours (2 a.m.-6 a.m.), accepting the risk of minor sales losses in case of a crash, versus major sales losses during the holiday shopping season.
- Multiply the hourly load test volume by 2.5 since their actual peaks are 10 hours long.
- Use a load testing service that uses real browsers in different locations around the U.S.
- Use an APM solution that identifies problems within the application while running the load test.
If you want to read more on common performance problems that are not found prior to moving to production check out my recent series of blogs: Supersized Content, Deployment Mistakes or Excessive Logging
In a report titled “Forecast Analysis: Enterprise Application Software, Worldwide, 2Q15 Update,” Gartner analysts highlighted the increasing trend of application modernization among enterprises. According to a recent survey, 45% of respondents stated that modernization of installed on-premises core enterprise applications is one of the top five priorities. Gartner also predicted that by 2020, 75% of
Oct. 7, 2015 04:00 AM EDT Reads: 254
It is with great pleasure that I am able to announce that Jesse Proudman, Blue Box CTO, has been appointed to the position of IBM Distinguished Engineer. Jesse is the first employee at Blue Box to receive this honor, and I’m quite confident there will be more to follow given the amazing talent at Blue Box with whom I have had the pleasure to collaborate. I’d like to provide an overview of what it means to become an IBM Distinguished Engineer.
Oct. 7, 2015 04:00 AM EDT Reads: 149
The cloud has reached mainstream IT. Those 18.7 million data centers out there (server closets to corporate data centers to colocation deployments) are moving to the cloud. In his session at 17th Cloud Expo, Achim Weiss, CEO & co-founder of ProfitBricks, will share how two companies – one in the U.S. and one in Germany – are achieving their goals with cloud infrastructure. More than a case study, he will share the details of how they prioritized their cloud computing infrastructure deployments ...
Oct. 7, 2015 03:00 AM EDT Reads: 691
SYS-CON Events announced today that G2G3 will exhibit at SYS-CON's @DevOpsSummit Silicon Valley, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Based on a collective appreciation for user experience, design, and technology, G2G3 is uniquely qualified and motivated to redefine how organizations and people engage in an increasingly digital world.
Oct. 7, 2015 03:00 AM EDT Reads: 366
If you are new to Python, you might be confused about the different versions that are available. Although Python 3 is the latest generation of the language, many programmers still use Python 2.7, the final update to Python 2, which was released in 2010. There is currently no clear-cut answer to the question of which version of Python you should use; the decision depends on what you want to achieve. While Python 3 is clearly the future of the language, some programmers choose to remain with Py...
Oct. 7, 2015 02:00 AM EDT Reads: 199
Opinions on how best to package and deliver applications are legion and, like many other aspects of the software world, are subject to recurring trend cycles. On the server-side, the current favorite is container delivery: a “full stack” approach in which your application and everything it needs to run are specified in a container definition. That definition is then “compiled” down to a container image and deployed by retrieving the image and passing it to a container runtime to create a running...
Oct. 7, 2015 12:30 AM EDT Reads: 160
Somebody call the buzzword police: we have a serious case of microservices-washing in progress. The term “microservices-washing” is derived from “whitewashing,” meaning to hide some inconvenient truth with bluster and nonsense. We saw plenty of cloudwashing a few years ago, as vendors and enterprises alike pretended what they were doing was cloud, even though it wasn’t. Today, the hype around microservices has led to the same kind of obfuscation, as vendors and enterprise technologists alike ar...
Oct. 7, 2015 12:00 AM EDT Reads: 392
“All our customers are looking at the cloud ecosystem as an important part of their overall product strategy. Some see it evolve as a multi-cloud / hybrid cloud strategy, while others are embracing all forms of cloud offerings like PaaS, IaaS and SaaS in their solutions,” noted Suhas Joshi, Vice President – Technology, at Harbinger Group, in this exclusive Q&A with Cloud Expo Conference Chair Roger Strukhoff.
Oct. 6, 2015 02:45 PM EDT Reads: 370
Clearly the way forward is to move to cloud be it bare metal, VMs or containers. One aspect of the current public clouds that is slowing this cloud migration is cloud lock-in. Every cloud vendor is trying to make it very difficult to move out once a customer has chosen their cloud. In his session at 17th Cloud Expo, Naveen Nimmu, CEO of Clouber, Inc., will advocate that making the inter-cloud migration as simple as changing airlines would help the entire industry to quickly adopt the cloud wit...
Oct. 6, 2015 12:30 PM EDT Reads: 587
As the world moves towards more DevOps and microservices, application deployment to the cloud ought to become a lot simpler. The microservices architecture, which is the basis of many new age distributed systems such as OpenStack, NetFlix and so on, is at the heart of Cloud Foundry - a complete developer-oriented Platform as a Service (PaaS) that is IaaS agnostic and supports vCloud, OpenStack and AWS. In his session at 17th Cloud Expo, Raghavan "Rags" Srinivas, an Architect/Developer Evangeli...
Oct. 6, 2015 12:15 PM EDT Reads: 120
Culture is the most important ingredient of DevOps. The challenge for most organizations is defining and communicating a vision of beneficial DevOps culture for their organizations, and then facilitating the changes needed to achieve that. Often this comes down to an ability to provide true leadership. As a CIO, are your direct reports IT managers or are they IT leaders? The hard truth is that many IT managers have risen through the ranks based on their technical skills, not their leadership ab...
Oct. 6, 2015 11:00 AM EDT Reads: 851
Apps and devices shouldn't stop working when there's limited or no network connectivity. Learn how to bring data stored in a cloud database to the edge of the network (and back again) whenever an Internet connection is available. In his session at 17th Cloud Expo, Bradley Holt, Developer Advocate at IBM Cloud Data Services, will demonstrate techniques for replicating cloud databases with devices in order to build offline-first mobile or Internet of Things (IoT) apps that can provide a better, ...
Oct. 6, 2015 10:45 AM EDT Reads: 454
Despite all the talk about public cloud services and DevOps, you would think the move to cloud for enterprises is clear and simple. But in a survey of almost 1,600 IT decision makers across the USA and Europe, the state of the cloud in enterprise today is still fraught with considerable frustration. The business case for apps in the real world cloud is hybrid, bimodal, multi-platform, and difficult. Download this report commissioned by NTT Communications to see the insightful findings – registra...
Oct. 6, 2015 10:00 AM EDT Reads: 215
Application availability is not just the measure of “being up”. Many apps can claim that status. Technically they are running and responding to requests, but at a rate which users would certainly interpret as being down. That’s because excessive load times can (and will be) interpreted as “not available.” That’s why it’s important to view ensuring application availability as requiring attention to all its composite parts: scalability, performance, and security.
Oct. 6, 2015 09:00 AM EDT Reads: 365
SYS-CON Events announced today that HPM Networks will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. For 20 years, HPM Networks has been integrating technology solutions that solve complex business challenges. HPM Networks has designed solutions for both SMB and enterprise customers throughout the San Francisco Bay Area.
Oct. 6, 2015 09:00 AM EDT Reads: 573
There once was a time when testers operated on their own, in isolation. They’d huddle as a group around the harsh glow of dozens of CRT monitors, clicking through GUIs and recording results. Anxiously, they’d wait for the developers in the other room to fix the bugs they found, yet they’d frequently leave the office disappointed as issues were filed away as non-critical. These teams would rarely interact, save for those scarce moments when a coder would wander in needing to reproduce a particula...
Oct. 6, 2015 08:45 AM EDT Reads: 262
What Is Emergent About Emergent Architecture? By @TheEbizWizard | @DevOpsSummit #DevOps #BigData #API
All we need to do is have our teams self-organize, and behold! Emergent design and/or architecture springs up out of the nothingness! If only it were that easy, right? I follow in the footsteps of so many people who have long wondered at the meanings of such simple words, as though they were dogma from on high. Emerge? Self-organizing? Profound, to be sure. But what do we really make of this sentence?
Oct. 6, 2015 08:00 AM EDT Reads: 380
As we increasingly rely on technology to improve the quality and efficiency of our personal and professional lives, software has become the key business differentiator. Organizations must release software faster, as well as ensure the safety, security, and reliability of their applications. The option to make trade-offs between time and quality no longer exists—software teams must deliver quality and speed. To meet these expectations, businesses have shifted from more traditional approaches of d...
Oct. 6, 2015 07:45 AM EDT Reads: 157
Information overload has infiltrated our lives. From the amount of news available and at our fingertips 24/7, to the endless choices we have when making a simple purchase, to the quantity of emails we receive on a given day, it’s increasingly difficult to sift out the details that really matter. When you envision your cloud monitoring system, the same thinking applies. We receive a lot of useless data that gets fed into the system, and the reality is no one in IT or DevOps has the time to manu...
Oct. 6, 2015 07:00 AM EDT Reads: 499
Last month, my partners in crime – Carmen DeArdo from Nationwide, Lee Reid, my colleague from IBM and I wrote a 3-part series of blog posts on DevOps.com. We titled our posts the Simple Math, Calculus and Art of DevOps. I would venture to say these are must-reads for any organization adopting DevOps. We examined all three ascpects – the Cultural, Automation and Process improvement side of DevOps. One of the key underlying themes of the three posts was the need for Cultural change – things like t...
Oct. 6, 2015 04:15 AM EDT Reads: 276