|By AppDynamics Blog||
|November 20, 2012 01:52 PM EST||
Welcome back to my series on Deploying APM in the Enterprise. In Part 4: Path of the Rockstar, we discussed how to deploy your new monitoring tool and get maximum value from your time and monetary commitment. This post will cover one of the most important aspects of monitoring: alerting. This is the topic that can make or break your entire implementation. Get it wrong and you've wasted a bunch of time and money on mediocre results. Get it right and your time and money investment will be multiplied by the value you derive every day.
App Man wrote a great blog post earlier this year about behavioral learning and analytics as they apply to alerts. If you haven't already done so, I suggest you go read it after you finish this post. Instead of repeating what was covered in that post, we will explore the issues that I saw out in real enterprise operations centers.
Traditional Alerting Methods Don't Work Well
Do any of these sound familiar?
- "I got paged at 3 AM with a high CPU alert. It was backups running and consuming the CPU. This happens almost every week! Maybe we should turn change the threshold setting and timing."
- "We just got a notification of high disk and network I/O rates. Is that normal? Does anyone know if our app is still working right?"
- "We just got an alert on high JVM memory usage. Can someone use the app to see if anything is wrong?"
- "We just got a call from a user complaining that the website is slow but there were no alerts."
Comments like these are a way of life when you set static thresholds (ex. CPU utilization > 90% for 5 minutes) on metrics that aren't direct indicators of application performance. It's the equivalent of taking a person's heart rate while they are exercising to see if the person is performing as expected. A really high heart rate might indicate that a person is performing well or that they are about to die of a heart attack. Heart rate would be a supporting metric to something more meaningful like how long it took to run the past 1/4 mile. The same holds true for application performance. We will explore this concept further a little later.
Storm of the Century ... Again!
One of the most important lessons I learned while working in large enterprise environments is that you will almost always set static thresholds wrong. Set them too high and you run the risk of missing a real problem. Set them too low and you will get so many alerts that they become irrelevant as you spend all of your time chasing "problems" that don't really exist. Getting massive amounts of alerts in a short period of time is referred to as an "Alert Storm" and is really despised in the IT Operations world. Alert storms send masses of people scrambling trying to determine if and what kind of impact there really is to the business.
Alert Storms are so detrimental to operations that companies spend a lot of money on systems designed to prevent alert storms. These systems become a central aggregation point for alerts and rules are written that try to intelligently address alert storm conditions. This method just adds to the overhead costs and complexity of your overall monitoring environment and should ideally never have to be considered.
Alerts Done Right - Business Impact
The right question to ask now is; "How can alerting be done the right way without spending more time and money than it costs to develop and run my applications?"
Your most critical, intelligent, trusted (or whatever other buzz words makes sense here) alerts should be based off metrics that directly represent business impact. Following are a few examples:
- End user response time (good indicator of regional issues)
- Business transaction response time (good indicator of systemic issues)
- Business transaction throughput rate (do we see the same amount of traffic as usual?)
- Number of widgets sold (is there a problem preventing users from buying?)
Now that you know what type of metrics should be the triggers for your alerts, you need to know what the proper alerting method is for these metrics. By now you should know that I am going to discourage the use of static thresholds. Your monitoring tool needs to support behavioral baselining and alerts based upon deviation from baselines. Simply put, your monitoring tool needs to automatically learn normal behavior for each metric and only alert if there is a large enough deviation from that normal behavior.
Now let me point out that I do not hate static thresholds. On the contrary, I find them useful in certain situations. For example, if I've promised a 300 ms response time from the service that I manage, I really want an alert if ANY transactions take longer than 300 ms so I can identify the root cause and make sure it never happens again. That is a perfect time to set up a static threshold but it is more of an outlier case when it comes to alerting.
Here is a real world example of how powerful behavioral based alerts are compared to static based. When I was working for the Investment Banking division of a global Financial Services firm, the operations center received an alert that was based upon deviation from normal behavior. The alert was routed to the application support team who quickly identified the issue and were able to avoid an outage of their trading platform. A post event analysis reveled that the behavioral based alert triggered 45 minutes before an old static based alert would have been sent out. This 45 minute head start enabled the support team to completely avoid business impact, which equated to saving millions of dollars per hour in lost revenue for that particular application.
I love it when you recoup the cost of your monitoring tools by avoiding a single outage!!!
Integration, Not Segregation
Now that we know about behavioral learning and alerting, and that we need to focus on metrics that directly correlate to business impact, what else is important when it comes to alerts?
Integration and analysis of alerts and data can help reduce your MTTR (mean time to repair) from hours/days/weeks to minutes. When your operations center receives an alert, they usually just forward it on to the appropriate support team and wait to hear back on the resolution. If done right, your operations center can pass along a full set of meaningful information to the proper support team so that they can act almost immediately. Imagine sending an email to support that contained a link to a slow "checkout" business transaction plus charts of all of the supporting metrics (CPU, garbage collection, network i/o, etc...) that deviated from normal behavior before, during, and after the time of the slow transaction. That's way more powerful than sending an alert from ops to app support that complains of high CPU utilization on a given host.
You Can't Afford to Live in the Past
The IT world is constantly changing. What once was "cutting edge" has transitioned through "good enough" and is full blown "you still use that?" Alerts from static thresholds based upon metrics that have no relationship to business impact are costing your organization time and money. Monitoring Rockstars are constantly adapting to the changing IT landscape and making sure their organization takes advantage of the strategies and technologies that enable competitive advantage.
When you use the right monitoring tools with the proper alerting strategy, you help your organization improve customer service, focus on creating new and better product, and increase profits all by reducing the number and length of application outages. So implement the strategies discussed here, document your success, and then go ask for a raise!
Thanks for taking the time to read this week's installment in my continuing series. Next week I'll share my thoughts and experience on increasing adoption of your monitoring tools across organizational silos to really crank up the value proposition.
Right off the bat, Newman advises that we should "think of microservices as a specific approach for SOA in the same way that XP or Scrum are specific approaches for Agile Software development". These analogies are very interesting because my expectation was that microservices is a pattern. So I might infer that microservices is a set of process techniques as opposed to an architectural approach. Yet in the book, Newman clearly includes some elements of concept model and architecture as well as p...
May. 29, 2015 07:00 PM EDT Reads: 4,074
How can you compare one technology or tool to its competitors? Usually, there is no objective comparison available. So how do you know which is better? Eclipse or IntelliJ IDEA? Java EE or Spring? C# or Java? All you can usually find is a holy war and biased comparisons on vendor sites. But luckily, sometimes, you can find a fair comparison. How does this come to be? By having it co-authored by the stakeholders. The binary repository comparison matrix is one of those rare resources. It is edite...
May. 29, 2015 06:00 PM EDT Reads: 2,208
Container technology is sending shock waves through the world of cloud computing. Heralded as the 'next big thing,' containers provide software owners a consistent way to package their software and dependencies while infrastructure operators benefit from a standard way to deploy and run them. Containers present new challenges for tracking usage due to their dynamic nature. They can also be deployed to bare metal, virtual machines and various cloud platforms. How do software owners track the usag...
May. 29, 2015 03:45 PM EDT Reads: 1,455
Cloud Expo, Inc. has announced today that Andi Mann returns to DevOps Summit 2015 as Conference Chair. The 4th International DevOps Summit will take place on June 9-11, 2015, at the Javits Center in New York City. "DevOps is set to be one of the most profound disruptions to hit IT in decades," said Andi Mann. "It is a natural extension of cloud computing, and I have seen both firsthand and in independent research the fantastic results DevOps delivers. So I am excited to help the great team at ...
May. 29, 2015 02:00 PM EDT Reads: 2,520
Enterprises are fast realizing the importance of integrating SaaS/Cloud applications, API and on-premises data and processes, to unleash hidden value. This webinar explores how managers can use a Microservice-centric approach to aggressively tackle the unexpected new integration challenges posed by proliferation of cloud, mobile, social and big data projects. Industry analyst and SOA expert Jason Bloomberg will strip away the hype from microservices, and clearly identify their advantages and d...
May. 29, 2015 01:15 PM EDT Reads: 2,761
SYS-CON Events announced today that MetraTech, now part of Ericsson, has been named “Silver Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York, NY. Ericsson is the driving force behind the Networked Society- a world leader in communications infrastructure, software and services. Some 40% of the world’s mobile traffic runs through networks Ericsson has supplied, serving more than 2.5 billion subscribers.
May. 29, 2015 01:00 PM EDT Reads: 2,479
SYS-CON Events announced today that O'Reilly Media has been named “Media Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York City, NY. O'Reilly Media spreads the knowledge of innovators through its books, online services, magazines, and conferences. Since 1978, O'Reilly Media has been a chronicler and catalyst of cutting-edge development, homing in on the technology trends that really matter and spurring their adoption...
May. 29, 2015 12:30 PM EDT Reads: 1,352
Software development, like manufacturing, is a craft that requires the application of creative approaches to solve problems given a wide range of constraints. However, while engineering design may be craftwork, the production of most designed objects relies on a standardized and automated manufacturing process. By contrast, much of moving an application from prototype to production and, indeed, maintaining the application through its lifecycle has often remained craftwork. In his session at Dev...
May. 29, 2015 12:00 PM EDT Reads: 2,111
Amazon, Google and Facebook are household names in part because of their mastery of Big Data. But what about organizations without billions of dollars to spend on Big Data tools - how can they extract value from their data? In his session at 6th Big Data Expo®, Ali Ghodsi, Co-Founder and Head of Engineering at Databricks, discussed how the zero management cost and scalability of the cloud is addressing the challenges and pain points that data engineers face when working with Big Data. He also s...
May. 29, 2015 12:00 PM EDT Reads: 4,719
The 4th International Internet of @ThingsExpo, co-located with the 17th International Cloud Expo - to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA - announces that its Call for Papers is open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.
May. 29, 2015 12:00 PM EDT Reads: 2,686
SYS-CON Events announced today that EnterpriseDB (EDB), the leading worldwide provider of enterprise-class Postgres products and database compatibility solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. EDB is the largest provider of Postgres software and services that provides enterprise-class performance and scalability and the open source freedom to divert budget from more costly traditiona...
May. 29, 2015 11:00 AM EDT Reads: 2,262
The Internet of Things is a misnomer. That implies that everything is on the Internet, and that simply should not be - especially for things that are blurring the line between medical devices that stimulate like a pacemaker and quantified self-sensors like a pedometer or pulse tracker. The mesh of things that we manage must be segmented into zones of trust for sensing data, transmitting data, receiving command and control administrative changes, and peer-to-peer mesh messaging. In his session a...
May. 29, 2015 11:00 AM EDT Reads: 4,487
May. 29, 2015 10:00 AM EDT Reads: 2,328
I read an insightful article this morning from Bernard Golden on DZone discussing the DevOps conundrum facing many enterprises today – is it better to build your own DevOps tools or go commercial? For Golden, the question arose from his observations at a number of DevOps Days events he has attended, where typically the audience is composed of startup professionals: “I have to say, though, that a typical feature of most presentations is a recitation of the various open source products and compo...
May. 29, 2015 10:00 AM EDT Reads: 1,548
The 5th International DevOps Summit, co-located with 17th International Cloud Expo – being held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the...
May. 29, 2015 10:00 AM EDT Reads: 4,864
This is the final installment of the six-part series Microservices and PaaS. It seems like forever since I attended Adrian Cockroft's meetup focusing on microservices. It's actually only been a couple of months, but much has happened since then: countless articles, meetups, and conference sessions focusing on microservices have been delivered, many meetings and design efforts at companies moving towards a microservices-based approach have been endured, and five installments of this blog series ...
May. 29, 2015 09:30 AM EDT Reads: 2,857
There is no question that the cloud is where businesses want to host data. Until recently hypervisor virtualization was the most widely used method in cloud computing. Recently virtual containers have been gaining in popularity, and for good reason. In the debate between virtual machines and containers, the latter have been seen as the new kid on the block – and like other emerging technology have had some initial shortcomings. However, the container space has evolved drastically since coming on...
May. 29, 2015 09:15 AM EDT Reads: 2,427
Bill Doerrfeld at Nordic APIs has written today about how APIs are evolving the B2B landscape. This is a particularly interesting article for me, because my personal background is working for an EDI provider, where I linked EDI processes from the private network to the Internet, over 15 years ago. Vordel was founded to allow new Web Services APIs to be used for B2B. Axway, a B2B software company, acquired Vordel in 2012 to link B2B with Web APIs. This caused a domino effect, with other API Manag...
May. 29, 2015 09:15 AM EDT Reads: 905
In the first four parts of this series I presented an introduction to microservices along with a handful of emerging microservices patterns, and a discussion of some of the downsides and challenges to using microservices. The most recent installment of this series looked at ten ways that PaaS facilitates microservices development and adoption. In this post I’ll cover some words of wisdom, advice intended for individuals, teams, and organizations considering a move to microservices. I've gleaned...
May. 29, 2015 09:00 AM EDT Reads: 3,332
Python is really a language which has swept the scene in recent years in terms of popularity, elegance, and functionality. Research shows that 8 out 10 computer science departments in the U.S. now teach their introductory courses with Python, surpassing Java. Top-ranked CS departments at MIT and UC Berkeley have switched their introductory courses to Python. And the top three MOOC providers (edX, Coursera, and Udacity) all offer introductory programming courses in Python. Not to mention, Python ...
May. 29, 2015 09:00 AM EDT Reads: 1,877