|By AppDynamics Blog||
|November 20, 2012 01:52 PM EST||
Welcome back to my series on Deploying APM in the Enterprise. In Part 4: Path of the Rockstar, we discussed how to deploy your new monitoring tool and get maximum value from your time and monetary commitment. This post will cover one of the most important aspects of monitoring: alerting. This is the topic that can make or break your entire implementation. Get it wrong and you've wasted a bunch of time and money on mediocre results. Get it right and your time and money investment will be multiplied by the value you derive every day.
App Man wrote a great blog post earlier this year about behavioral learning and analytics as they apply to alerts. If you haven't already done so, I suggest you go read it after you finish this post. Instead of repeating what was covered in that post, we will explore the issues that I saw out in real enterprise operations centers.
Traditional Alerting Methods Don't Work Well
Do any of these sound familiar?
- "I got paged at 3 AM with a high CPU alert. It was backups running and consuming the CPU. This happens almost every week! Maybe we should turn change the threshold setting and timing."
- "We just got a notification of high disk and network I/O rates. Is that normal? Does anyone know if our app is still working right?"
- "We just got an alert on high JVM memory usage. Can someone use the app to see if anything is wrong?"
- "We just got a call from a user complaining that the website is slow but there were no alerts."
Comments like these are a way of life when you set static thresholds (ex. CPU utilization > 90% for 5 minutes) on metrics that aren't direct indicators of application performance. It's the equivalent of taking a person's heart rate while they are exercising to see if the person is performing as expected. A really high heart rate might indicate that a person is performing well or that they are about to die of a heart attack. Heart rate would be a supporting metric to something more meaningful like how long it took to run the past 1/4 mile. The same holds true for application performance. We will explore this concept further a little later.
Storm of the Century ... Again!
One of the most important lessons I learned while working in large enterprise environments is that you will almost always set static thresholds wrong. Set them too high and you run the risk of missing a real problem. Set them too low and you will get so many alerts that they become irrelevant as you spend all of your time chasing "problems" that don't really exist. Getting massive amounts of alerts in a short period of time is referred to as an "Alert Storm" and is really despised in the IT Operations world. Alert storms send masses of people scrambling trying to determine if and what kind of impact there really is to the business.
Alert Storms are so detrimental to operations that companies spend a lot of money on systems designed to prevent alert storms. These systems become a central aggregation point for alerts and rules are written that try to intelligently address alert storm conditions. This method just adds to the overhead costs and complexity of your overall monitoring environment and should ideally never have to be considered.
Alerts Done Right - Business Impact
The right question to ask now is; "How can alerting be done the right way without spending more time and money than it costs to develop and run my applications?"
Your most critical, intelligent, trusted (or whatever other buzz words makes sense here) alerts should be based off metrics that directly represent business impact. Following are a few examples:
- End user response time (good indicator of regional issues)
- Business transaction response time (good indicator of systemic issues)
- Business transaction throughput rate (do we see the same amount of traffic as usual?)
- Number of widgets sold (is there a problem preventing users from buying?)
Now that you know what type of metrics should be the triggers for your alerts, you need to know what the proper alerting method is for these metrics. By now you should know that I am going to discourage the use of static thresholds. Your monitoring tool needs to support behavioral baselining and alerts based upon deviation from baselines. Simply put, your monitoring tool needs to automatically learn normal behavior for each metric and only alert if there is a large enough deviation from that normal behavior.
Now let me point out that I do not hate static thresholds. On the contrary, I find them useful in certain situations. For example, if I've promised a 300 ms response time from the service that I manage, I really want an alert if ANY transactions take longer than 300 ms so I can identify the root cause and make sure it never happens again. That is a perfect time to set up a static threshold but it is more of an outlier case when it comes to alerting.
Here is a real world example of how powerful behavioral based alerts are compared to static based. When I was working for the Investment Banking division of a global Financial Services firm, the operations center received an alert that was based upon deviation from normal behavior. The alert was routed to the application support team who quickly identified the issue and were able to avoid an outage of their trading platform. A post event analysis reveled that the behavioral based alert triggered 45 minutes before an old static based alert would have been sent out. This 45 minute head start enabled the support team to completely avoid business impact, which equated to saving millions of dollars per hour in lost revenue for that particular application.
I love it when you recoup the cost of your monitoring tools by avoiding a single outage!!!
Integration, Not Segregation
Now that we know about behavioral learning and alerting, and that we need to focus on metrics that directly correlate to business impact, what else is important when it comes to alerts?
Integration and analysis of alerts and data can help reduce your MTTR (mean time to repair) from hours/days/weeks to minutes. When your operations center receives an alert, they usually just forward it on to the appropriate support team and wait to hear back on the resolution. If done right, your operations center can pass along a full set of meaningful information to the proper support team so that they can act almost immediately. Imagine sending an email to support that contained a link to a slow "checkout" business transaction plus charts of all of the supporting metrics (CPU, garbage collection, network i/o, etc...) that deviated from normal behavior before, during, and after the time of the slow transaction. That's way more powerful than sending an alert from ops to app support that complains of high CPU utilization on a given host.
You Can't Afford to Live in the Past
The IT world is constantly changing. What once was "cutting edge" has transitioned through "good enough" and is full blown "you still use that?" Alerts from static thresholds based upon metrics that have no relationship to business impact are costing your organization time and money. Monitoring Rockstars are constantly adapting to the changing IT landscape and making sure their organization takes advantage of the strategies and technologies that enable competitive advantage.
When you use the right monitoring tools with the proper alerting strategy, you help your organization improve customer service, focus on creating new and better product, and increase profits all by reducing the number and length of application outages. So implement the strategies discussed here, document your success, and then go ask for a raise!
Thanks for taking the time to read this week's installment in my continuing series. Next week I'll share my thoughts and experience on increasing adoption of your monitoring tools across organizational silos to really crank up the value proposition.
SYS-CON Events announced today that HPM Networks will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. For 20 years, HPM Networks has been integrating technology solutions that solve complex business challenges. HPM Networks has designed solutions for both SMB and enterprise customers throughout the San Francisco Bay Area.
Sep. 1, 2015 12:30 PM EDT Reads: 909
Whether you like it or not, DevOps is on track for a remarkable alliance with security. The SEC didn’t approve the merger. And your boss hasn’t heard anything about it. Yet, this unruly triumvirate will soon dominate and deliver DevSecOps faster, cheaper, better, and on an unprecedented scale. In his session at DevOps Summit, Frank Bunger, VP of Customer Success at ScriptRock, will discuss how this cathartic moment will propel the DevOps movement from such stuff as dreams are made on to a prac...
Sep. 1, 2015 12:00 PM EDT Reads: 240
SYS-CON Events announced today that Pythian, a global IT services company specializing in helping companies leverage disruptive technologies to optimize revenue-generating systems, has been named “Bronze Sponsor” of SYS-CON's 17th Cloud Expo, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Founded in 1997, Pythian is a global IT services company that helps companies compete by adopting disruptive technologies such as cloud, Big Data, advance...
Sep. 1, 2015 11:45 AM EDT Reads: 321
The pricing of tools or licenses for log aggregation can have a significant effect on organizational culture and the collaboration between Dev and Ops teams. Modern tools for log aggregation (of which Logentries is one example) can be hugely enabling for DevOps approaches to building and operating business-critical software systems. However, the pricing of an aggregated logging solution can affect the adoption of modern logging techniques, as well as organizational capabilities and cross-team ...
Sep. 1, 2015 11:30 AM EDT Reads: 410
Culture is the most important ingredient of DevOps. The challenge for most organizations is defining and communicating a vision of beneficial DevOps culture for their organizations, and then facilitating the changes needed to achieve that. Often this comes down to an ability to provide true leadership. As a CIO, are your direct reports IT managers or are they IT leaders? The hard truth is that many IT managers have risen through the ranks based on their technical skills, not their leadership ab...
Sep. 1, 2015 11:15 AM EDT Reads: 373
In today's digital world, change is the one constant. Disruptive innovations like cloud, mobility, social media, and the Internet of Things have reshaped the market and set new standards in customer expectations. To remain competitive, businesses must tap the potential of emerging technologies and markets through the rapid release of new products and services. However, the rigid and siloed structures of traditional IT platforms and processes are slowing them down – resulting in lengthy delivery ...
Sep. 1, 2015 10:45 AM EDT Reads: 608
Several years ago, I was a developer in a travel reservation aggregator. Our mission was to pull flight and hotel data from a bunch of cryptic reservation platforms, and provide it to other companies via an API library - for a fee. That was before companies like Expedia standardized such things. We started with simple methods like getFlightLeg() or addPassengerName(), each performing a small, well-understood function. But our customers wanted bigger, more encompassing services that would "do ...
Sep. 1, 2015 10:30 AM EDT Reads: 278
Docker containerization is increasingly being used in production environments. How can these environments best be monitored? Monitoring Docker containers as if they are lightweight virtual machines (i.e., monitoring the host from within the container), with all the common metrics that can be captured from an operating system, is an insufficient approach. Docker containers can’t be treated as lightweight virtual machines; they must be treated as what they are: isolated processes running on hosts....
Sep. 1, 2015 08:30 AM EDT Reads: 165
Introducing Containers & Microservices Bootcamp at @CloudExpo Silicon Valley | #Containers #Microservices
SYS-CON Events announced today the Containers & Microservices Bootcamp, being held November 3-4, 2015, in conjunction with 17th Cloud Expo, @ThingsExpo, and @DevOpsSummit at the Santa Clara Convention Center in Santa Clara, CA. This is your chance to get started with the latest technology in the industry. Combined with real-world scenarios and use cases, the Containers and Microservices Bootcamp, led by Janakiram MSV, a Microsoft Regional Director, will include presentations as well as hands-on...
Sep. 1, 2015 08:15 AM EDT Reads: 360
DevOps has traditionally played important roles in development and IT operations, but the practice is quickly becoming core to other business functions such as customer success, business intelligence, and marketing analytics. Modern marketers today are driven by data and rely on many different analytics tools. They need DevOps engineers in general and server log data specifically to do their jobs well. Here’s why: Server log files contain the only data that is completely full and accurate in th...
Sep. 1, 2015 07:45 AM EDT Reads: 407
Skeuomorphism usually means retaining existing design cues in something new that doesn’t actually need them. However, the concept of skeuomorphism can be thought of as relating more broadly to applying existing patterns to new technologies that, in fact, cry out for new approaches. In his session at DevOps Summit, Gordon Haff, Senior Cloud Strategy Marketing and Evangelism Manager at Red Hat, discussed why containers should be paired with new architectural practices such as microservices rathe...
Sep. 1, 2015 01:00 AM EDT Reads: 414
SYS-CON Events announced today that G2G3 will exhibit at SYS-CON's @DevOpsSummit Silicon Valley, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Based on a collective appreciation for user experience, design, and technology, G2G3 is uniquely qualified and motivated to redefine how organizations and people engage in an increasingly digital world.
Aug. 31, 2015 11:00 PM EDT Reads: 508
Any Ops team trying to support a company in today’s cloud-connected world knows that a new way of thinking is required – one just as dramatic than the shift from Ops to DevOps. The diversity of modern operations requires teams to focus their impact on breadth vs. depth. In his session at DevOps Summit, Adam Serediuk, Director of Operations at xMatters, Inc., will discuss the strategic requirements of evolving from Ops to DevOps, and why modern Operations has begun leveraging the “NoOps” approa...
Aug. 31, 2015 10:30 PM EDT Reads: 408
Puppet Labs has announced the next major update to its flagship product: Puppet Enterprise 2015.2. This release includes new features providing DevOps teams with clarity, simplicity and additional management capabilities, including an all-new user interface, an interactive graph for visualizing infrastructure code, a new unified agent and broader infrastructure support.
Aug. 31, 2015 06:30 PM EDT Reads: 527
Early in my DevOps Journey, I was introduced to a book of great significance circulating within the Web Operations industry titled The Phoenix Project. (You can read our review of Gene’s book, if interested.) Written as a novel and loosely based on many of the same principles explored in The Goal, this book has been read and referenced by many who have adopted DevOps into their continuous improvement and software delivery processes around the world. As I began planning my travel schedule last...
Aug. 31, 2015 06:00 PM EDT Reads: 547
SYS-CON Events announced today that DataClear Inc. will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. The DataClear ‘BlackBox’ is the only solution that moves your PC, browsing and data out of the United States and away from prying (and spying) eyes. Its solution automatically builds you a clean, on-demand, virus free, new virtual cloud based PC outside of the United States, and wipes it clean...
Aug. 31, 2015 01:45 PM EDT Reads: 430
In his session at 17th Cloud Expo, Ernest Mueller, Product Manager at Idera, will explain the best practices and lessons learned for tracking and optimizing costs while delivering a cloud-hosted service. He will describe a DevOps approach where the applications and systems work together to track usage, model costs in a granular fashion, and make smart decisions at runtime to minimize costs. The trickier parts covered include triggering off the right metrics; balancing resilience and redundancy ...
Aug. 31, 2015 08:00 AM EDT Reads: 240
It’s been proven time and time again that in tech, diversity drives greater innovation, better team productivity and greater profits and market share. So what can we do in our DevOps teams to embrace diversity and help transform the culture of development and operations into a true “DevOps” team? In her session at DevOps Summit, Stefana Muller, Director, Product Management – Continuous Delivery at CA Technologies, answered that question citing examples, showing how to create opportunities for ...
Aug. 31, 2015 03:00 AM EDT Reads: 495
What does “big enough” mean? It’s sometimes useful to argue by reductio ad absurdum. Hello, world doesn’t need to be broken down into smaller services. At the other extreme, building a monolithic enterprise resource planning (ERP) system is just asking for trouble: it’s too big, and it needs to be decomposed.
Aug. 29, 2015 10:00 AM EDT Reads: 366
The Microservices architectural pattern promises increased DevOps agility and can help enable continuous delivery of software. This session is for developers who are transforming existing applications to cloud-native applications, or creating new microservices style applications. In his session at DevOps Summit, Jim Bugwadia, CEO of Nirmata, will introduce best practices, patterns, challenges, and solutions for the development and operations of microservices style applications. He will discuss ...
Aug. 27, 2015 02:15 PM EDT Reads: 525