Welcome!

Microservices Expo Authors: Elizabeth White, Pat Romanski, Reinhard Brandstädter, Liz McMillan, Cynthia Dunlop

Related Topics: Microservices Expo

Microservices Expo: Blog Feed Post

Deploying APM in the Enterprise | Part 5

Alerts – Storm of the Century – Every Week!

Welcome back to my series on Deploying APM in the Enterprise. In Part 4: Path of the Rockstar, we discussed how to deploy your new monitoring tool and get maximum value from your time and monetary commitment. This post will cover one of the most important aspects of monitoring: alerting. This is the topic that can make or break your entire implementation. Get it wrong and you've wasted a bunch of time and money on mediocre results. Get it right and your time and money investment will be multiplied by the value you derive every day.

App Man wrote a great blog post earlier this year about behavioral learning and analytics as they apply to alerts. If you haven't already done so, I suggest you go read it after you finish this post. Instead of repeating what was covered in that post, we will explore the issues that I saw out in real enterprise operations centers.

Traditional Alerting Methods Don't Work Well
Do any of these sound familiar?

  • "I got paged at 3 AM with a high CPU alert. It was backups running and consuming the CPU. This happens almost every week! Maybe we should turn change the threshold setting and timing."
  • "We just got a notification of high disk and network I/O rates. Is that normal? Does anyone know if our app is still working right?"
  • "We just got an alert on high JVM memory usage. Can someone use the app to see if anything is wrong?"
  • "We just got a call from a user complaining that the website is slow but there were no alerts."

Comments like these are a way of life when you set static thresholds (ex. CPU utilization > 90% for 5 minutes) on metrics that aren't direct indicators of application performance. It's the equivalent of taking a person's heart rate while they are exercising to see if the person is performing as expected. A really high heart rate might indicate that a person is performing well or that they are about to die of a heart attack. Heart rate would be a supporting metric to something more meaningful like how long it took to run the past 1/4 mile. The same holds true for application performance. We will explore this concept further a little later.

Storm of the Century ... Again!

One of the most important lessons I learned while working in large enterprise environments is that you will almost always set static thresholds wrong. Set them too high and you run the risk of missing a real problem. Set them too low and you will get so many alerts that they become irrelevant as you spend all of your time chasing "problems" that don't really exist. Getting massive amounts of alerts in a short period of time is referred to as an "Alert Storm" and is really despised in the IT Operations world. Alert storms send masses of people scrambling trying to determine if and what kind of impact there really is to the business.
Alert Storms are so detrimental to operations that companies spend a lot of money on systems designed to prevent alert storms. These systems become a central aggregation point for alerts and rules are written that try to intelligently address alert storm conditions. This method just adds to the overhead costs and complexity of your overall monitoring environment and should ideally never have to be considered.

Alerts Done Right - Business Impact

The right question to ask now is; "How can alerting be done the right way without spending more time and money than it costs to develop and run my applications?"

Your most critical, intelligent, trusted (or whatever other buzz words makes sense here) alerts should be based off metrics that directly represent business impact. Following are a few examples:

  • End user response time (good indicator of regional issues)
  • Business transaction response time (good indicator of systemic issues)
  • Business transaction throughput rate (do we see the same amount of traffic as usual?)
  • Number of widgets sold (is there a problem preventing users from buying?)

Now that you know what type of metrics should be the triggers for your alerts, you need to know what the proper alerting method is for these metrics. By now you should know that I am going to discourage the use of static thresholds. Your monitoring tool needs to support behavioral baselining and alerts based upon deviation from baselines. Simply put, your monitoring tool needs to automatically learn normal behavior for each metric and only alert if there is a large enough deviation from that normal behavior.

Now let me point out that I do not hate static thresholds. On the contrary, I find them useful in certain situations. For example, if I've promised a 300 ms response time from the service that I manage, I really want an alert if ANY transactions take longer than 300 ms so I can identify the root cause and make sure it never happens again. That is a perfect time to set up a static threshold but it is more of an outlier case when it comes to alerting.

Here is a real world example of how powerful behavioral based alerts are compared to static based. When I was working for the Investment Banking division of a global Financial Services firm, the operations center received an alert that was based upon deviation from normal behavior. The alert was routed to the application support team who quickly identified the issue and were able to avoid an outage of their trading platform. A post event analysis reveled that the behavioral based alert triggered 45 minutes before an old static based alert would have been sent out. This 45 minute head start enabled the support team to completely avoid business impact, which equated to saving millions of dollars per hour in lost revenue for that particular application.

I love it when you recoup the cost of your monitoring tools by avoiding a single outage!!!

Integration, Not Segregation
Now that we know about behavioral learning and alerting, and that we need to focus on metrics that directly correlate to business impact, what else is important when it comes to alerts?

Integration and analysis of alerts and data can help reduce your MTTR (mean time to repair) from hours/days/weeks to minutes. When your operations center receives an alert, they usually just forward it on to the appropriate support team and wait to hear back on the resolution. If done right, your operations center can pass along a full set of meaningful information to the proper support team so that they can act almost immediately. Imagine sending an email to support that contained a link to a slow "checkout" business transaction plus charts of all of the supporting metrics (CPU, garbage collection, network i/o, etc...) that deviated from normal behavior before, during, and after the time of the slow transaction. That's way more powerful than sending an alert from ops to app support that complains of high CPU utilization on a given host.

You Can't Afford to Live in the Past
The IT world is constantly changing. What once was "cutting edge" has transitioned through "good enough" and is full blown "you still use that?" Alerts from static thresholds based upon metrics that have no relationship to business impact are costing your organization time and money. Monitoring Rockstars are constantly adapting to the changing IT landscape and making sure their organization takes advantage of the strategies and technologies that enable competitive advantage.

When you use the right monitoring tools with the proper alerting strategy, you help your organization improve customer service, focus on creating new and better product, and increase profits all by reducing the number and length of application outages. So implement the strategies discussed here, document your success, and then go ask for a raise!

Thanks for taking the time to read this week's installment in my continuing series. Next week I'll share my thoughts and experience on increasing adoption of your monitoring tools across organizational silos to really crank up the value proposition.

Read the original blog entry...

More Stories By AppDynamics Blog

In high-production environments where release cycles are measured in hours or minutes — not days or weeks — there's little room for mistakes and no room for confusion. Everyone has to understand what's happening, in real time, and have the means to do whatever is necessary to keep applications up and running optimally.

DevOps is a high-stakes world, but done well, it delivers the agility and performance to significantly impact business competitiveness.

@MicroservicesExpo Stories
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, wh...
The 19th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit y...
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo New York Call for Papers is now open.
Many banks and financial institutions are experimenting with containers in development environments, but when will they move into production? Containers are seen as the key to achieving the ultimate in information technology flexibility and agility. Containers work on both public and private clouds, and make it easy to build and deploy applications. The challenge for regulated industries is the cost and complexity of container security compliance. VM security compliance is already challenging, ...
IoT generates lots of temporal data. But how do you unlock its value? How do you coordinate the diverse moving parts that must come together when developing your IoT product? What are the key challenges addressed by Data as a Service? How does cloud computing underlie and connect the notions of Digital and DevOps What is the impact of the API economy? What is the business imperative for Cognitive Computing? Get all these questions and hundreds more like them answered at the 18th Cloud Expo...
Cloud-based NCLC (No-code/low code) application builder platforms empower everyone in the organization to quickly build applications and executable processes that broaden access, deepen collaboration, and enhance transparency for all team members. Line of business owners (LOBO) and operations managers know best their part of the business and their processes. IT departments are beginning to leverage NCLC platforms to empower and enable LOBOs to lead the innovation, transform the organization, an...
@DevOpsSummit taking place June 7-9, 2016 at Javits Center, New York City, and Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 18th International @CloudExpo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world.
SYS-CON Events announced today the Docker Meets Kubernetes – Intro into the Kubernetes World, being held June 9, 2016, in conjunction with 18th Cloud Expo | @ThingsExpo, at the Javits Center in New York, NY. Register for 'Docker Meets Kubernetes Workshop' Here! This workshop led by Sebastian Scheele, co-founder of Loodse, introduces participants to Kubernetes (container orchestration). Through a combination of instructor-led presentations, demonstrations, and hands-on labs, participants learn ...
Just last week a senior Hybris consultant shared the story of a customer engagement on which he was working. This customer had problems, serious problems. We’re talking about response times far beyond the most liberal acceptable standard. They were unable to solve the issue in their eCommerce platform – specifically Hybris. Although the eCommerce project was delivered by a system integrator / implementation partner, the vendor still gets involved when things go really wrong. After all, the vendo...
The initial debate is over: Any enterprise with a serious commitment to IT is migrating to the cloud. But things are not so simple. There is a complex mix of on-premises, colocated, and public-cloud deployments. In this power panel at 18th Cloud Expo, moderated by Conference Chair Roger Strukhoff, panelists will look at the present state of cloud from the C-level view, and how great companies and rock star executives can use cloud computing to meet their most ambitious and disruptive business ...
Agile teams report the lowest rate of measuring non-functional requirements. What does this mean for the evolution of quality in this era of Continuous Everything? To explore how the rise of SDLC acceleration trends such as Agile, DevOps, and Continuous Delivery are impacting software quality, Parasoft conducted a survey about measuring and monitoring non-functional requirements (NFRs). Here's a glimpse at what we discovered and what it means for the evolution of quality in this era of Continuo...
You might already know them from theagileadmin.com, but let me introduce you to two of the leading minds in the Rugged DevOps movement: James Wickett and Ernest Mueller. Both James and Ernest are active leaders in the DevOps space, in addition to helping organize events such as DevOpsDays Austinand LASCON. Our conversation covered a lot of bases from the founding of Rugged DevOps to aligning organizational silos to lessons learned from W. Edwards Demings.
When I talk about driving innovation with self-organizing teams, I emphasize that such self-organization includes expecting the participants to organize their own teams, give themselves their own goals, and determine for themselves how to measure their success. In contrast, the definition of skunkworks points out that members of such teams are “usually specially selected.” Good thing he added the word usually – because specially selecting such teams throws a wrench in the entire works, limiting...
As AT&Ts VP of Domain 2.0 architecture writes one aspect of their Domain 2.0 strategy is a goal to embrace a Microservices Application Architecture. One page 9 they describe how these envisage them fitting into the ECOMP architecture: "The initial steps of the recipes include a homing and placement task using constraints specified in the requests. ‘Homing and Placement' are micro-services involving orchestration, inventory, and controllers responsible for infrastructure, network, and applicati...
Application development and delivery methods have undergone radical changes in recent years to improve scalability and resiliency. Container images are the new build and deployment artifacts that are used to ship and run software. While startups have long been comfortable experimenting with and embracing new technologies, even large enterprises are now re-architecting their software systems so that they can benefit from container-enabled micro services architectures. With the launch of DC/OS, w...
Earlier this week, we hosted a Continuous Discussion (#c9d9) on Continuous Delivery (CD) automation and orchestration, featuring expert panelists Dondee Tan, Test Architect at Alaska Air, Taco Bakker, a LEAN Six Sigma black belt focusing on CD, and our own Sam Fell and Anders Wallgren. During this episode, we discussed the differences between CD automation and orchestration, their challenges with setting up CD pipelines and some of the common chokepoints, as well as some best practices and tips...
SYS-CON Events announced today TechTarget has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. TechTarget is the Web’s leading destination for serious technology buyers researching and making enterprise technology decisions. Its extensive global networ...
Korean Broadcasting System (KBS) will feature the upcoming 18th Cloud Expo | @ThingsExpo in a New York news documentary about the "New IT for the Future." The documentary will cover how big companies are transmitting or adopting the new IT for the future and will be filmed on the expo floor between June 7-June 9, 2016, at the Javits Center in New York City, New York. KBS has long been a leader in the development of the broadcasting culture of Korea. As the key public service broadcaster of Korea...
Automation is a critical component of DevOps and Continuous Delivery. This morning on #c9d9 we discussed CD Automation and how you can apply Automation to accelerate release cycles, improve quality, safety and governance? What is the difference between Automation and Orchestration? Where should you begin your journey to introduce both?
While there has been much ado about interoperability, there are still no real solutions, same as last year and the year before that. The large EHR vendors who continue to dominate the market still maintain that interoperability is all but solved, still can't connect EHRs across the continuum causing frustration by providers and a disservice to patients. The ONC pays lip service to the problem, but that is about it. It is time for the healthcare industry to consider alternatives like middleware w...