Welcome!

Microservices Expo Authors: Pat Romanski, Aruna Ravichandran, Elizabeth White, XebiaLabs Blog, Yeshim Deniz

Related Topics: Java IoT, Microservices Expo, Adobe Flex, Machine Learning , Apache

Java IoT: Article

Why Averages Are Inadequate, and Percentiles Are Great

Averages are ineffective because they are too simplistic and one-dimensional

Anyone who ever monitored or analyzed an application uses or has used averages. They are simple to understand and calculate. We tend to ignore just how wrong the picture is that averages paint of the world. To emphasis the point let me give you a real-world example outside of the performance space that I read recently in a newspaper.

The article was explaining that the average salary in a certain region in Europe was 1900 Euro's (to be clear this would be quite good in that region!). However when looking closer they found out that the majority, namely 9 out of 10 people, only earned around 1000 Euros and one would earn 10.000 (I over simplified this of course, but you get the idea). If you do the math you will see that the average of this is indeed 1900, but we can all agree that this does not represent the "average" salary as we would use the word in day to day live. So now let's apply this thinking to application performance.

The Average Response Time
The average response time is by far the most commonly used metric in application performance management. We assume that this represents a "normal" transaction, however this would only be true if the response time is always the same (all transaction run at equal speed) or the response time distribution is roughly bell curved.

A Bell curve represents the "normal" distribution of response times in which the average and the median are the same. It rarely ever occurs in real applications

In a Bell Curve the average (mean) and median are the same. In other words observed performance would represent the majority (half or more than half) of the transactions.

In reality most applications have few very heavy outliers; a statistician would say that the curve has a long tail. A long tail does not imply many slow transactions, but few that are magnitudes slower than the norm.

This is a typical Response Time Distribution with few but heavy outliers - it has a long tail. The average here is dragged to the right by the long tail.

We recognize that the average no longer represents the bulk of the transactions but can be a lot higher than the median.

You can now argue that this is not a problem as long as the average doesn't look better than the median. I would disagree, but let's look at another real-world scenario experienced by many of our customers:

This is another typical Response Time Distribution. Here we have quite a few very fast transactions that drag the average to the left of the actual median

In this case a considerable percentage of transactions are very, very fast (10-20 percent), while the bulk of transactions are several times slower. The median would still tell us the true story, but the average all of a sudden looks a lot faster than most of our transactions actually are. This is very typical in search engines or when caches are involved - some transactions are very fast, but the bulk are normal. Another reason for this scenario are failed transactions, more specifically transactions that failed fast. Many real-world applications have a failure rate of 1-10 percent (due to user errors or validation errors). These failed transactions are often magnitudes faster than the real ones and consequently distorted an average.

Of course performance analysts are not stupid and regularly try to compensate with higher frequency charts (compensating by looking at smaller aggregates visually) and by taking in minimum and maximum observed response times. However we can often only do this if we know the application very well, those unfamiliar with the application might easily misinterpret the charts. Because of the depth and type of knowledge required for this, it's difficult to communicate your analysis to other people - think how many arguments between IT teams have been caused by this. And that's before we even begin to think about communicating with business stakeholders!

A better metric by far are percentiles, because they allow us to understand the distribution. But before we look at percentiles, let's take a look a key feature in every production monitoring solution: Automatic Baselining and Alerting.

Automatic Baselining and Alerting
In real-world environments, performance gets attention when it is poor and has a negative impact on the business and users. But how can we identify performance issues quickly to prevent negative effects? We cannot alert on every slow transaction, since there are always some. In addition, most operations teams have to maintain a large number of applications and are not familiar with all of them, so manually setting thresholds can be inaccurate, quite painful and time-consuming.

The industry has come up with a solution called Automatic Baselining. Baselining calculates out the "normal" performance and only alerts us when an application slows down or produces more errors than usual. Most approaches rely on averages and standard deviations.

Without going into statistical details, this approach again assumes that the response times are distributed over a bell curve:

The Standard Deviation represents 33% of all transactions with the mean as the middle. 2xStandard Deviation represents 66% and thus the majority, everything outside could be considered an outlier. However most real world scenarios are not bell curved...

Typically, transactions that are outside two times standard deviation are treated as slow and captured for analysis. An alert is raised if the average moves significantly. In a bell curve this would account for the slowest 16.5 percent (and you can of course adjust that); however; if the response time distribution does not represent a bell curve, it becomes inaccurate. We either end up with a lot of false positives (transactions that are a lot slower than the average but when looking at the curve lie within the norm) or we miss a lot of problems (false negatives). In addition if the curve is not a bell curve, then the average can differ a lot from the median; applying a standard deviation to such an average can lead to quite a different result than you would expect. To work around this problem these algorithms have many tunable variables and a lot of "hacks" for specific use cases.

Why I Love Percentiles
A percentile tells me which part of the curve I am looking at and how many transactions are represented by that metric. To visualize this look at the following chart:

This chart shows the 50th and 90th percentile along with the average of the same transaction. It shows that the average is influenced far mor heavily by the 90th, thus by outliers and not by the bulk of the transactions

The green line represents the average. As you can see it is very volatile. The other two lines represent the 50th and 90th percentile. As we can see the 50th percentile (or median) is rather stable but has a couple of jumps. These jumps represent real performance degradation for the majority (50%) of the transactions. The 90th percentile (this is the start of the "tail") is a lot more volatile, which means that the outliers slowness depends on data or user behavior. What's important here is that the average is heavily influenced (dragged) by the 90th percentile, the tail, rather than the bulk of the transactions.

If the 50th percentile (median) of a response time is 500ms that means that 50% of my transactions are either as fast or faster than 500ms. If the 90th percentile of the same transaction is at 1000ms it means that 90% are as fast or faster and only 10% are slower. The average in this case could either be lower than 500ms (on a heavy front curve), a lot higher (long tail) or somewhere in between. A percentile gives me a much better sense of my real world performance, because it shows me a slice of my response time curve.

For exactly that reason percentiles are perfect for automatic baselining. If the 50th percentile moves from 500ms to 600ms I know that 50% of my transactions suffered a 20% performance degradation. You need to react to that.

In many cases we see that the 75th or 90th percentile does not change at all in such a scenario. This means the slow transactions didn't get any slower, only the normal ones did. Depending on how long your tail is the average might not have moved at all in such a scenario.!

In other cases we see the 98th percentile degrading from 1s to 1.5 seconds while the 95th is stable at 900ms. This means that your application as a whole is stable, but a few outliers got worse, nothing to worry about immediately. Percentile-based alerts do not suffer from false positives, are a lot less volatile and don't miss any important performance degradations! Consequently a baselining approach that uses percentiles does not require a lot of tuning variables to work effectively.

The screenshot below shows the Median (50th Percentile) for a particular transaction jumping from about 50ms to about 500ms and triggering an alert as it is significantly above the calculated baseline (green line). The chart labeled "Slow Response Time" on the other hand shows the 90th percentile for the same transaction. These "outliers" also show an increase in response time but not significant enough to trigger an alert.

Here we see an automatic baselining dashboard with a violation at the 50th percentile. The violation is quite clear, at the same time the 90th percentile (right upper chart) does not violate. Because the outliers are so much slower than the bulk of the transaction an average would have been influenced by them and would not have have reacted quite as dramatically as the 50th percentile. We might have missed this clear violation!

How Can We Use Percentiles for Tuning?
Percentiles are also great for tuning, and giving your optimizations a particular goal. Let's say that something within my application is too slow in general and I need to make it faster. In this case I want to focus on bringing down the 90th percentile. This would ensure sure that the overall response time of the application goes down. In other cases I have unacceptably long outliers I want to focus on bringing down response time for transactions beyond the 98th or 99th percentile (only outliers). We see a lot of applications that have perfectly acceptable performance for the 90th percentile, with the 98th percentile being magnitudes worse.

In throughput oriented applications on the other hand I would want to make the majority of my transactions very fast, while accepting that an optimization makes a few outliers slower. I might therefore make sure that the 75th percentile goes down while trying to keep the 90th percentile stable or not getting a lot worse.

I could not make the same kind of observations with averages, minimum and maximum, but with percentiles they are very easy indeed.

Conclusion
Averages are ineffective because they are too simplistic and one-dimensional. Percentiles are a really great and easy way of understanding the real performance characteristics of your application. They also provide a great basis for automatic baselining, behavioral learning and optimizing your application with a proper focus. In short, percentiles are great!

More Stories By Michael Kopp

Michael Kopp has over 12 years of experience as an architect and developer in the Enterprise Java space. Before coming to CompuwareAPM dynaTrace he was the Chief Architect at GoldenSource, a major player in the EDM space. In 2009 he joined dynaTrace as a technology strategist in the center of excellence. He specializes application performance management in large scale production environments with special focus on virtualized and cloud environments. His current focus is how to effectively leverage BigData Solutions and how these technologies impact and change the application landscape.

Comments (1) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
rtalexander 11/21/12 12:58:00 AM EST

Hey, could you post a reference or two that covers the theory and/or practicalities of the approach you describe?

Thanks!

@MicroservicesExpo Stories
Containers have changed the mind of IT in DevOps. They enable developers to work with dev, test, stage and production environments identically. Containers provide the right abstraction for microservices and many cloud platforms have integrated them into deployment pipelines. DevOps and containers together help companies achieve their business goals faster and more effectively. In his session at DevOps Summit, Ruslan Synytsky, CEO and Co-founder of Jelastic, reviewed the current landscape of Dev...
A lot of time, resources and energy has been invested over the past few years on de-siloing development and operations. And with good reason. DevOps is enabling organizations to more aggressively increase their digital agility, while at the same time reducing digital costs and risks. But as 2017 approaches, the hottest trends in DevOps aren’t specifically about dev or ops. They’re about testing, security, and metrics.
You often hear the two titles of "DevOps" and "Immutable Infrastructure" used independently. In his session at DevOps Summit, John Willis, Technical Evangelist for Docker, covered the union between the two topics and why this is important. He provided an overview of Immutable Infrastructure then showed how an Immutable Continuous Delivery pipeline can be applied as a best practice for "DevOps." He ended the session with some interesting case study examples.
Software delivery was once specific to the IT industry. Now, Continuous Delivery pipelines are used around world from e-commerce to airline software. Building a software delivery pipeline once involved hours of scripting and manual steps–a process that’s painful, if not impossible, to scale. However Continuous Delivery with Application Release Automation tools offers a scripting-free, automated experience. Continuous Delivery pipelines are immensely powerful for the modern enterprise, boosting ...
Using new techniques of information modeling, indexing, and processing, new cloud-based systems can support cloud-based workloads previously not possible for high-throughput insurance, banking, and case-based applications. In his session at 18th Cloud Expo, John Newton, CTO, Founder and Chairman of Alfresco, described how to scale cloud-based content management repositories to store, manage, and retrieve billions of documents and related information with fast and linear scalability. He addres...
Docker containers have brought great opportunities to shorten the deployment process through continuous integration and the delivery of applications and microservices. This applies equally to enterprise data centers as well as the cloud. In his session at 20th Cloud Expo, Jari Kolehmainen, founder and CTO of Kontena, will discuss solutions and benefits of a deeply integrated deployment pipeline using technologies such as container management platforms, Docker containers, and the drone.io Cl tool...
DevOps is being widely accepted (if not fully adopted) as essential in enterprise IT. But as Enterprise DevOps gains maturity, expands scope, and increases velocity, the need for data-driven decisions across teams becomes more acute. DevOps teams in any modern business must wrangle the ‘digital exhaust’ from the delivery toolchain, "pervasive" and "cognitive" computing, APIs and services, mobile devices and applications, the Internet of Things, and now even blockchain. In this power panel at @...
SYS-CON Events announced today that Catchpoint Systems, Inc., a provider of innovative web and infrastructure monitoring solutions, has been named “Silver Sponsor” of SYS-CON's DevOps Summit at 18th Cloud Expo New York, which will take place June 7-9, 2016, at the Javits Center in New York City, NY. Catchpoint is a leading Digital Performance Analytics company that provides unparalleled insight into customer-critical services to help consistently deliver an amazing customer experience. Designed ...
In his session at @DevOpsSummit at 19th Cloud Expo, Robert Doyle, lead architect at eCube Systems, will examine the issues and need for an agile infrastructure and show the advantages of capturing developer knowledge in an exportable file for migration into production. He will introduce the use of NXTmonitor, a next-generation DevOps tool that captures application environments, dependencies and start/stop procedures in a portable configuration file with an easy-to-use GUI. In addition to captur...
Containers have changed the mind of IT in DevOps. They enable developers to work with dev, test, stage and production environments identically. Containers provide the right abstraction for microservices and many cloud platforms have integrated them into deployment pipelines. DevOps and Containers together help companies to achieve their business goals faster and more effectively. In his session at DevOps Summit, Ruslan Synytsky, CEO and Co-founder of Jelastic, reviewed the current landscape of D...
"We got started as search consultants. On the services side of the business we have help organizations save time and save money when they hit issues that everyone more or less hits when their data grows," noted Otis Gospodnetić, Founder of Sematext, in this SYS-CON.tv interview at @DevOpsSummit, held June 9-11, 2015, at the Javits Center in New York City.
Internet of @ThingsExpo, taking place June 6-8, 2017 at the Javits Center in New York City, New York, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @ThingsExpo New York Call for Papers is now open.
We call it DevOps but much of the time there’s a lot more discussion about the needs and concerns of developers than there is about other groups. There’s a focus on improved and less isolated developer workflows. There are many discussions around collaboration, continuous integration and delivery, issue tracking, source code control, code review, IDEs, and xPaaS – and all the tools that enable those things. Changes in developer practices may come up – such as developers taking ownership of code ...
The proper isolation of resources is essential for multi-tenant environments. The traditional approach to isolate resources is, however, rather heavyweight. In his session at 18th Cloud Expo, Igor Drobiazko, co-founder of elastic.io, drew upon his own experience with operating a Docker container-based infrastructure on a large scale and present a lightweight solution for resource isolation using microservices. He also discussed the implementation of microservices in data and application integrat...
I’m told that it has been 21 years since Scrum became public when Jeff Sutherland and I presented it at an Object-Oriented Programming, Systems, Languages & Applications (OOPSLA) workshop in Austin, TX, in October of 1995. Time sure does fly. Things mature. I’m still in the same building and at the same company where I first formulated Scrum.[1] Initially nobody knew of Scrum, yet it is now an open source body of knowledge translated into more than 30 languages[2] People use Scrum worldwide for ...
Much of the value of DevOps comes from a (renewed) focus on measurement, sharing, and continuous feedback loops. In increasingly complex DevOps workflows and environments, and especially in larger, regulated, or more crystallized organizations, these core concepts become even more critical. In his session at @DevOpsSummit at 18th Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, showed how, by focusing on 'metrics that matter,' you can provide objective, transparent, and meaningful f...
"We're bringing out a new application monitoring system to the DevOps space. It manages large enterprise applications that are distributed throughout a node in many enterprises and we manage them as one collective," explained Kevin Barnes, President of eCube Systems, in this SYS-CON.tv interview at DevOps at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
As the race for the presidency heats up, IT leaders would do well to recall the famous catchphrase from Bill Clinton’s successful 1992 campaign against George H. W. Bush: “It’s the economy, stupid.” That catchphrase is important, because IT economics are important. Especially when it comes to cloud. Application performance management (APM) for the cloud may turn out to be as much about those economics as it is about customer experience.
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
@DevOpsSummit at Cloud taking place June 6-8, 2017, at Javits Center, New York City, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long developm...