|By AppDynamics Blog||
|November 20, 2012 01:52 PM EST||
Welcome back to my series on Deploying APM in the Enterprise. In Part 4: Path of the Rockstar, we discussed how to deploy your new monitoring tool and get maximum value from your time and monetary commitment. This post will cover one of the most important aspects of monitoring: alerting. This is the topic that can make or break your entire implementation. Get it wrong and you've wasted a bunch of time and money on mediocre results. Get it right and your time and money investment will be multiplied by the value you derive every day.
App Man wrote a great blog post earlier this year about behavioral learning and analytics as they apply to alerts. If you haven't already done so, I suggest you go read it after you finish this post. Instead of repeating what was covered in that post, we will explore the issues that I saw out in real enterprise operations centers.
Traditional Alerting Methods Don't Work Well
Do any of these sound familiar?
- "I got paged at 3 AM with a high CPU alert. It was backups running and consuming the CPU. This happens almost every week! Maybe we should turn change the threshold setting and timing."
- "We just got a notification of high disk and network I/O rates. Is that normal? Does anyone know if our app is still working right?"
- "We just got an alert on high JVM memory usage. Can someone use the app to see if anything is wrong?"
- "We just got a call from a user complaining that the website is slow but there were no alerts."
Comments like these are a way of life when you set static thresholds (ex. CPU utilization > 90% for 5 minutes) on metrics that aren't direct indicators of application performance. It's the equivalent of taking a person's heart rate while they are exercising to see if the person is performing as expected. A really high heart rate might indicate that a person is performing well or that they are about to die of a heart attack. Heart rate would be a supporting metric to something more meaningful like how long it took to run the past 1/4 mile. The same holds true for application performance. We will explore this concept further a little later.
Storm of the Century ... Again!
One of the most important lessons I learned while working in large enterprise environments is that you will almost always set static thresholds wrong. Set them too high and you run the risk of missing a real problem. Set them too low and you will get so many alerts that they become irrelevant as you spend all of your time chasing "problems" that don't really exist. Getting massive amounts of alerts in a short period of time is referred to as an "Alert Storm" and is really despised in the IT Operations world. Alert storms send masses of people scrambling trying to determine if and what kind of impact there really is to the business.
Alert Storms are so detrimental to operations that companies spend a lot of money on systems designed to prevent alert storms. These systems become a central aggregation point for alerts and rules are written that try to intelligently address alert storm conditions. This method just adds to the overhead costs and complexity of your overall monitoring environment and should ideally never have to be considered.
Alerts Done Right - Business Impact
The right question to ask now is; "How can alerting be done the right way without spending more time and money than it costs to develop and run my applications?"
Your most critical, intelligent, trusted (or whatever other buzz words makes sense here) alerts should be based off metrics that directly represent business impact. Following are a few examples:
- End user response time (good indicator of regional issues)
- Business transaction response time (good indicator of systemic issues)
- Business transaction throughput rate (do we see the same amount of traffic as usual?)
- Number of widgets sold (is there a problem preventing users from buying?)
Now that you know what type of metrics should be the triggers for your alerts, you need to know what the proper alerting method is for these metrics. By now you should know that I am going to discourage the use of static thresholds. Your monitoring tool needs to support behavioral baselining and alerts based upon deviation from baselines. Simply put, your monitoring tool needs to automatically learn normal behavior for each metric and only alert if there is a large enough deviation from that normal behavior.
Now let me point out that I do not hate static thresholds. On the contrary, I find them useful in certain situations. For example, if I've promised a 300 ms response time from the service that I manage, I really want an alert if ANY transactions take longer than 300 ms so I can identify the root cause and make sure it never happens again. That is a perfect time to set up a static threshold but it is more of an outlier case when it comes to alerting.
Here is a real world example of how powerful behavioral based alerts are compared to static based. When I was working for the Investment Banking division of a global Financial Services firm, the operations center received an alert that was based upon deviation from normal behavior. The alert was routed to the application support team who quickly identified the issue and were able to avoid an outage of their trading platform. A post event analysis reveled that the behavioral based alert triggered 45 minutes before an old static based alert would have been sent out. This 45 minute head start enabled the support team to completely avoid business impact, which equated to saving millions of dollars per hour in lost revenue for that particular application.
I love it when you recoup the cost of your monitoring tools by avoiding a single outage!!!
Integration, Not Segregation
Now that we know about behavioral learning and alerting, and that we need to focus on metrics that directly correlate to business impact, what else is important when it comes to alerts?
Integration and analysis of alerts and data can help reduce your MTTR (mean time to repair) from hours/days/weeks to minutes. When your operations center receives an alert, they usually just forward it on to the appropriate support team and wait to hear back on the resolution. If done right, your operations center can pass along a full set of meaningful information to the proper support team so that they can act almost immediately. Imagine sending an email to support that contained a link to a slow "checkout" business transaction plus charts of all of the supporting metrics (CPU, garbage collection, network i/o, etc...) that deviated from normal behavior before, during, and after the time of the slow transaction. That's way more powerful than sending an alert from ops to app support that complains of high CPU utilization on a given host.
You Can't Afford to Live in the Past
The IT world is constantly changing. What once was "cutting edge" has transitioned through "good enough" and is full blown "you still use that?" Alerts from static thresholds based upon metrics that have no relationship to business impact are costing your organization time and money. Monitoring Rockstars are constantly adapting to the changing IT landscape and making sure their organization takes advantage of the strategies and technologies that enable competitive advantage.
When you use the right monitoring tools with the proper alerting strategy, you help your organization improve customer service, focus on creating new and better product, and increase profits all by reducing the number and length of application outages. So implement the strategies discussed here, document your success, and then go ask for a raise!
Thanks for taking the time to read this week's installment in my continuing series. Next week I'll share my thoughts and experience on increasing adoption of your monitoring tools across organizational silos to really crank up the value proposition.
SYS-CON Events announced today that Solgenia will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY, and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Solgenia is the global market leader in Cloud Collaboration and Cloud Infrastructure software solutions. Designed to “Bridge the Gap” between Personal and Professional S...
Mar. 29, 2015 03:00 PM EDT Reads: 2,820
SYS-CON Events announced today that MangoApps will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY., and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. MangoApps provides private all-in-one social intranets allowing workers to securely collaborate from anywhere in the world and from any device. Social, mobile, and eas...
Mar. 29, 2015 03:00 PM EDT Reads: 3,020
Hosted PaaS providers have given independent developers and startups huge advantages in efficiency and reduced time-to-market over their more process-bound counterparts in enterprises. Software frameworks are now available that allow enterprise IT departments to provide these same advantages for developers in their own organization. In his workshop session at DevOps Summit, Troy Topnik, ActiveState’s Technical Product Manager, will show how on-prem or cloud-hosted Private PaaS can enable organ...
Mar. 29, 2015 12:45 PM EDT Reads: 1,222
When it comes to microservices there are myths and uncertainty about the journey ahead. Deploying a “Hello World” app on Docker is a long way from making microservices work in real enterprises with large applications, complex environments and existing organizational structures. February 19, 2015 10:00am PT / 1:00pm ET → 45 Minutes Join our four experts: Special host Gene Kim, Gary Gruver, Randy Shoup and XebiaLabs’ Andrew Phillips as they explore the realities of microservices in today’s IT worl...
Mar. 29, 2015 12:45 PM EDT Reads: 1,809
Cloud computing is changing the way we look at IT costs, according to industry experts on a recent Cloud Luminary Fireside Chat panel discussion. Enterprise IT, traditionally viewed as a cost center, now plays a central role in the delivery of software-driven goods and services. Therefore, companies need to understand their cloud utilization and resulting costs in order to ensure profitability on their business offerings. Led by Bernard Golden, this fireside chat offers valuable insights on ho...
Mar. 29, 2015 12:45 PM EDT Reads: 792
OmniTI has expanded its services to help customers automate their processes to deliver high quality applications to market faster. Consistent with its focus on IT agility and quality, OmniTI operates under DevOps principles, exploring the flow of value through the IT delivery process, identifying opportunities to eliminate waste, realign misaligned incentives, and open bottlenecks. OmniTI takes a unique, value-centric approach by plotting each opportunity in an effort-payoff quadrant, then work...
Mar. 29, 2015 12:45 PM EDT Reads: 822
The world's leading Cloud event, Cloud Expo has launched Microservices Journal on the SYS-CON.com portal, featuring over 19,000 original articles, news stories, features, and blog entries. DevOps Journal is focused on this critical enterprise IT topic in the world of cloud computing. Microservices Journal offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. Follow new article posts on T...
Mar. 29, 2015 12:00 PM EDT Reads: 1,441
Modern Systems announced completion of a successful project with its new Rapid Program Modernization (eavRPMa"c) software. The eavRPMa"c technology architecturally transforms legacy applications, enabling faster feature development and reducing time-to-market for critical software updates. Working with Modern Systems, the University of California at Santa Barbara (UCSB) leveraged eavRPMa"c to transform its Student Information System from Software AG's Natural syntax to a modern application lev...
Mar. 29, 2015 11:00 AM EDT Reads: 988
For those of us that have been practicing SOA for over a decade, it's surprising that there's so much interest in microservices. In fairness microservices don't look like the vendor play that was early SOA in the early noughties. But experienced SOA practitioners everywhere will be wondering if microservices is actually a good thing. You see microservices is basically an SOA pattern that inherits all the well-known SOA principles and adds characteristics that address the use of SOA for distribut...
Mar. 29, 2015 11:00 AM EDT Reads: 1,010
Microservice architectures are the new hotness, even though they aren't really all that different (in principle) from the paradigm described by SOA (which is dead, or not dead, depending on whom you ask). One of the things this decompositional approach to application architecture does is encourage developers and operations (some might even say DevOps) to re-evaluate scaling strategies. In particular, the notion is forwarded that an application should be built to scale and then infrastructure sho...
Mar. 29, 2015 11:00 AM EDT Reads: 2,455
SYS-CON Events announced today the IoT Bootcamp – Jumpstart Your IoT Strategy, being held June 9–10, 2015, in conjunction with 16th Cloud Expo and Internet of @ThingsExpo at the Javits Center in New York City. This is your chance to jumpstart your IoT strategy. Combined with real-world scenarios and use cases, the IoT Bootcamp is not just based on presentations but includes hands-on demos and walkthroughs. We will introduce you to a variety of Do-It-Yourself IoT platforms including Arduino, Ras...
Mar. 29, 2015 11:00 AM EDT Reads: 2,087
Microservices are the result of decomposing applications. That may sound a lot like SOA, but SOA was based on an object-oriented (noun) premise; that is, services were built around an object - like a customer - with all the necessary operations (functions) that go along with it. SOA was also founded on a variety of standards (most of them coming out of OASIS) like SOAP, WSDL, XML and UDDI. Microservices have no standards (at least none deriving from a standards body or organization) and can be b...
Mar. 29, 2015 10:45 AM EDT Reads: 2,152
Our guest on the podcast this week is Jason Bloomberg, President at Intellyx. When we build services we want them to be lightweight, stateless and scalable while doing one thing really well. In today's cloud world, we're revisiting what to takes to make a good service in the first place. Listen in to learn why following "the book" doesn't necessarily mean that you're solving key business problems.
Mar. 29, 2015 10:45 AM EDT Reads: 1,274
Right off the bat, Newman advises that we should "think of microservices as a specific approach for SOA in the same way that XP or Scrum are specific approaches for Agile Software development". These analogies are very interesting because my expectation was that microservices is a pattern. So I might infer that microservices is a set of process techniques as opposed to an architectural approach. Yet in the book, Newman clearly includes some elements of concept model and architecture as well as p...
Mar. 29, 2015 10:15 AM EDT Reads: 2,128
Microservices, for the uninitiated, are essentially the decomposition of applications into multiple services. This decomposition is often based on functional lines, with related functions being grouped together into a service. While this may sound a like SOA, it really isn't, especially given that SOA was an object-centered methodology that focused on creating services around "nouns" like customer and product. Microservices, while certainly capable of being noun-based, are just as likely to be v...
Mar. 29, 2015 10:00 AM EDT Reads: 1,850
SYS-CON Events announced today the DevOps Foundation Certification Course, being held June ?, 2015, in conjunction with DevOps Summit and 16th Cloud Expo at the Javits Center in New York City, NY. This sixteen (16) hour course provides an introduction to DevOps – the cultural and professional movement that stresses communication, collaboration, integration and automation in order to improve the flow of work between software developers and IT operations professionals. Improved workflows will res...
Mar. 29, 2015 10:00 AM EDT Reads: 1,632
An explosive combination of technology trends will be where ‘microservices’ and the IoT Internet of Things intersect, a concept we can describe by comparing it with a previous theme, the ‘X Internet.' The idea of using small self-contained application components has been popular since XML Web services began and a distributed computing future of smart fridges and kettles was imagined long back in the early Internet years.
Mar. 29, 2015 09:00 AM EDT Reads: 2,175
Even though it’s now Microservices Journal, long-time fans of SOA World Magazine can take comfort in the fact that the URL – soa.sys-con.com – remains unchanged. And that’s no mistake, as microservices are really nothing more than a new and improved take on the Service-Oriented Architecture (SOA) best practices we struggled to hammer out over the last decade. Skeptics, however, might say that this change is nothing more than an exercise in buzzword-hopping. SOA is passé, and now that people are ...
Mar. 29, 2015 09:00 AM EDT Reads: 1,253
SOA Software has changed its name to Akana. With roots in Web Services and SOA Governance, Akana has established itself as a leader in API Management and is expanding into cloud integration as an alternative to the traditional heavyweight enterprise service bus (ESB). The company recently announced that it achieved more than 90% year-over-year growth. As Akana, the company now addresses the evolution and diversification of SOA, unifying security, management, and DevOps across SOA, APIs, microser...
Mar. 29, 2015 08:30 AM EDT Reads: 2,048
Exelon Corporation employs technology and process improvements to optimize their IT operations, manage a merger and acquisition transition, and to bring outsourced IT operations back in-house. To learn more about how this leading energy provider in the US, with a family of companies having $23.5 billion in annual revenue, accomplishes these goals we're joined by Jason Thomas, Manager of Service, Asset and Release Management at Exelon. The discussion is moderated by me, Dana Gardner, Principal A...
Mar. 29, 2015 07:30 AM EDT Reads: 768