Microservices Expo Authors: Elizabeth White, Ken Schwaber, Pat Romanski, Aruna Ravichandran, Carmen Gonzalez

Related Topics: Java IoT, Microservices Expo, Recurring Revenue

Java IoT: Book Excerpt

Book Excerpt: Java Application Profiling Tips and Tricks

Part 1 - Common types of performance issues

In the book Java Performance, Chapter 5, "Java Application Profiling," presented the basic concepts of using a modern Java profiler such as the Oracle Solaris Studio Performance Analyzer and NetBeans Profiler. It did not, however, show any specific tips and tricks in using the tools to identify performance issues and approaches of how to resolve them. This is the purpose of this excerpt. Its intention is to show how to use the tools to identify performance issues and take corrective actions to resolve them. This excerpt looks at several of the more common types of performance issues the authors have observed through many years of working as Java performance engineers.

Performance Opportunities
Most Java performance opportunities fall into one or more of the following categories:

•  Using a more efficient algorithm. The largest gains in the performance of an application come from the use of a more efficient algorithm. The use of a more efficient algorithm allows an application to execute with fewer CPU instructions, also known as a shorter path length. An application that executes with a shorter path length generally executes faster. Many different changes can lead to a shorter path length. At the highest level of the application, using a different data structure or modifying its implementation can lead to a shorter path length. Many applications that suffer application performance issues often use inappropriate data structures. There is no substitute for choosing the proper data structure and algorithm. As profiles are analyzed, take notice of the data structures and the algorithms used. Optimal performance can be realized when the best data structures and algorithms are utilized.

•  Reduce lock contention. Contending for access to a shared resource inhibits an application's capability to scale to a large number of software threads and across a large number of CPUs. Changes to an application that allow for less frequent lock contention and less duration of locking allow an application to scale better.

•  Generate more efficient code for a given algorithm. Clocks per CPU instruction, usually referred to as CPI, for an application is a ratio of the number of CPU clock ticks used per CPU instruction. CPI is a measure of the efficiency of generated code that is produced by a compiler. A change in the application, JVM, or operating system that reduces the CPI for an application will realize an improvement in its performance since it takes advantage of better and more optimized generated code.

There is a subtle difference between path length, which is closely tied to the algorithm choice, and cycles per instruction, CPI, which is the notion of generating more efficient code. In the former, the objective is to produce the shortest sequence of CPU instructions based on the algorithm choice. The latter's objective is to reduce the number of CPU clocks consumed per CPU instruction, that is, produce the most efficient code from a compiler. To illustrate with an example, suppose a CPU instruction results in a CPU cache miss, such as a load instruction. It may take several hundred CPU clock cycles for that load instruction to complete as a result of the CPU cache miss having to fetch data from memory rather than finding it in a CPU cache. However, if a prefetch instruction was inserted upstream in the sequence of instructions generated by a compiler to prefetch from memory the data being loaded by the load instruction, it is likely the number of clock cycles required to load the data will be less with the additional prefetch instruction since the prefetch can be done in parallel with other CPU instructions ahead of the load instruction. When the load instruction occurs, it can then find the data to be loaded in a CPU cache. However, the path length, the number of CPU instructions executed is longer as a result of the additional prefetch instruction. Therefore, it is possible to increase path length, yet make better use of available CPU cycles.

The following sections present several strategies to consider when analyzing a profile and looking for optimization opportunities. Generally, optimization opportunities for most applications fall into one of the general categories just described.

System or Kernel CPU Usage
Chapter 2, "Operating System Performance Monitoring," suggests one of the statistics to monitor is system or kernel CPU utilization. If CPU clock cycles are spent executing operating system or kernel code, those are CPU clock cycles that cannot be used to execute your application. Hence, a strategy to improve the performance of an application is to reduce the amount of time it spends consuming system or kernel CPU clock cycles. However, this strategy is not applicable in applications that spend little time executing system or kernel code. Monitoring the operating system for system or kernel CPU utilization provides the data as to whether it makes sense to employ this strategy.

The Oracle Solaris Performance Analyzer collects system or kernel CPU statistics as part of an application profile. This is done by selecting the View > Set Data Presentation menu in Performance Analyzer, choosing the Metrics tab, and setting the options to present system CPU utilization statistics, both inclusive or exclusive. Recall that inclusive metrics include not only the time spent in a given method, but also the time spent in methods it calls. In contrast, exclusive metrics report only the amount of time spent in a given method.

It can be useful to include both inclusive and exclusive metrics when first analyzing a profile. Looking at the inclusive metrics provides a sense of the path the application executes. Looking at the general path an application takes you may identify an opportunity for an alternative algorithm or approach that may offer better performance.

Figure 1 shows the Performance Analyzer's Set Data Presentation form with options selected to present both inclusive and exclusive System CPU metrics. Also notice the options selected report both the raw time value and the percentage of System CPU time.

Figure 1: Set system CPU data presentation


Figure 2: Exclusive system CPU

After clicking on the OK button, the Performance Analyzer displays the profile's System CPU inclusive and exclusive metrics in descending order. The arrow in the metric column header indicates how the data is presented and sorted. In Figure 2, the System CPU data is ordered by the exclusive metric (notice the arrow in the exclusive metric header and the icon indicating an exclusive metric).

Figure 2 shows a profile from an application that exhibits high system or kernel CPU utilization. You can see this application consumed about 33.5 seconds of System CPU in the java.io.FileOutputStream.write(int) method and about 11.6 seconds in a method called __write(), or about 65% and 22.5%, respectively. You can also get a sense of how significant the improvement can be realized by reducing the System CPU utilization of this application. The ideal situation for an application is to have 0% System CPU utilization. But for some applications that goal is difficult to achieve, especially if there is I/O involved, since I/O operations require a system call. In applications that require I/O, the goal is to reduce the frequency of making a system call. One approach to reduce the call frequency of an I/O system call is buffer the data so that larger chunks of data are read or written during I/O operations.

In the example shown in Figure 2, you can see the file write (output) operations are consuming a large amount of time as illustrated by the java.io.FileOutputStream.write(int) and __write() entries. To identify whether the write operations are buffered, you can use the Callers-Callees tab to walk up the call stack to see what methods are calling the FileOutputStream.write(int) method and the __write method. You walk up the call stack by selecting one of the callees from the upper panel and clicking the Set Center button. Figure 3 shows the Callers-Callees of the FileOutputStream.write(int) method.

The callers of FileOutputStream.write(int) are ExtOutputStream.write(int) and OutImpl.outc(int). 85.18% of the System CPU attributed to FileOutputStream.write(int) comes from its use in ExtOutputStream.write(int) and 14.82% of it from OutImpl.outc(int). A look at the implementation of ExtOutputStream.write(int) shows:

Figure 3: FileOutputStream.write(int) callers and callees

public void write(int b) throws IOException {
A look at the implementation of super.write(b) shows it is not a call to FileOutputStream.write(int):
public void write(int b) throws IOException {
crc = crc * 33 + b;

But the writer field in ExtOutputStream is declared as a FileOutputStream:

private FileOutputStream writer;

And it is initialized without any type of buffering:

writer = new FileOutputStream(currentFileName);
currentFileName is a field declared as a String:
private String currentFileName;

Hence, an optimization to be applied here is to buffer the data being written to FileOutputStream in ExtOutputStream using a BufferedOutputStream. This is done rather quickly and easily by chaining or wrapping the FileOutputStream in a BufferedOutputStream in an ExtOutputStream. Here is a quick listing of the changes required:

// Change FileOutputStream writer to a BufferedOutputStream
// private FileOutputStream writer;
private BufferedOutputStream writer;

Then chain a BufferedOutputStream and FileOutputStream at initialization time:

// Initialize BufferedOutputStream
// writer = new FileOutputStream(currentFileName);
writer = new BufferedOutputStream(
new FileOutputStream(currentFileName));

Writing to the BufferedOutputStream, instead of the FileOutputStream, in ExtOutputStream.write(int b) does not require any update since BufferOutputStream has a write() method that buffers bytes written to it. This ExtOutputStream.write(int b) method is shown here:

public void write(int b) throws IOException {
// No update required here,
// automatically uses BufferedOutputStream.write()


The other uses of the writer field must be inspected to ensure the use of BufferedOutputStream operates as expected. In ExtStreamOutput, there are two additional uses of the writer field, one in a method called reset() and another in checkResult(). These two methods are as follows:

public void reset() {
try {
if (diffOutputStream != null) {
diffOutputStream = null;
if (writer != null) {

} catch (IOException e) {
public void checkResult(int loopNumber) {
try {

} catch (IOException e) {
outProperties.put(propertyName, "" + getCRC());

The uses of writer as a BufferedOutputStream works as expected. It should be noted that the API specification for BufferedOutputStream.close() indicates it calls the BufferedOutputStream.flush() method and then calls the close() method of its underlying output stream, in this case the FileOutputStream.close() method. As a result, the FileOutputStream is not required to be explicitly closed, nor is the flush() method in ExtOutputStream.checkResult(int) required. A couple of additional enhancements worth consideration are:

1. A BufferedOutputStream can also be allocated with an optional buffered size. The default buffer size, as of Java 6, is 8192. If the application you are profiling is writing a large number of bytes, you might consider specifying an explicit size larger than 8192. If you specify an explicit size, consider a size that is a multiple of the operating systems page size since operating systems efficiently fetch memory that are multiples of the operating system page size. On Oracle Solaris, the pagesize command with no arguments reports the default page size. On Linux, the default page size can be obtained using the getconf PAGESIZE command. Windows on x86 and x64 platforms default to a 4K (4096) page size.

2. Change the ExtOutputStream.writer field from an explicit BufferedOutputStream type to an OutputStream type, that is, OutputStream writer = new BufferedOutputStream(), instead of BufferedOutputStream writer = new BufferedOutputStream(). This allows for additional flexibility in type of OutputStream, for example, ByteArrayOutputStream, DataOutputStream, FilterOutputStream, FileOutputStream, or BufferedOutputStream.

Looking back at Figure 3, a second method calls FileOutputStream.write(int) called org.w3c.tidy.OutImpl.outc(int), which is a method from a third-party library used in the profiled application. To reduce the amount of system CPU utilization used in a third-party supplied method, the best approach is to file a bug or enhancement request with the third-party library provider and include the information from the profile. If the source is accessible via an open source license and has acceptable license terms, you may consider further investigating and including additional information in the bug or enhancement request report.

After applying the changes identified in ExtOutputStream, using the BufferedOutputStream and its default constructor (not including the two additional enhancements just mentioned), and collecting a profile, the amount of system CPU utilization drops substantially.  Comparing the profiles in Figure 6-4 to those in Figure 6-2, you can see the amount of inclusive system CPU time spent in java.io.FileOutputStream has dropped from 45.182 seconds to 6.655 seconds (exclusive system CPU time is the second column).

Executing this application workload outside the profiler in a performance testing environment prior to making the modifications reports it took this application 427 seconds to run to completion.  In constrast, the modified version of the application workload that uses the BufferOutputStream in the same performance testing environment reports it runs to completion in 383 seconds. In other words, this application realized about a 10% improvement in its run to completion execution.

In addition, looking at the Callers-Callees tab for java.io.FileOutputStream.write(int), only the call to org.w3c.tidy.OutImpl.outc(int) remains as a significant consumer of the FileOutputStream.write(int) method. The Callers-Callees of FileOutputStream.write(int) are shown in Figure 6-5.

Figure 4: Reduced system CPU utilization

Figure 5: Callers-Callees after changes

Comparing the Callers-Callees in Figure 5, after the changes to ExtStream Output, with the Callers-Callees in Figure 3, prior to the changes, you can see the amount of attributable time spent in org.w3c.tidy.OutImpl.outc(int) stays close to the same. This should not be a surprise since the changes made to ExtStreamOutput now use BufferedOutputStream. But recall that the BufferedOutputStream invokes a FileOutputStream method when any of the underlying buffer in the BufferedOutputStream becomes full, the BufferedOutputStream.flush() method is called, or when the BufferedOutputSteam.close() method is called. If you look back at Figure 4 you see a FileOutputStream.writeBytes(byte[], int, int) method. This is the method that the BufferedOutputStream calls from ExtStreamOutput. Figure 6 shows the Callers-Callees tab for the FileOutputStream.writeBytes(byte[], int, int).

Selecting java.io.FileOutputStream.write(byte[], int, int) method from the upper Callee panel and clicking the Set Center button illustrates that BufferedOutputStream.flushBuffer() is its callee; see Figures 6 and 7.

Figure 6: Callers-Callees of FileOutputStream.writeBytes(byte[],int,int)

Figure 7: Callers-Callees of FileOutputStream.writeBytes(byte[], int, int)

Figure 8: Callers-Callees of BufferedOutputStream.flushBuffer()

Figure 9: Callers-Callees of BufferedOutputStream.write(int)

Selecting the BufferedOutputStream.flushBuffer() method in the upper Callee panel and clicking the Set Center button shows the callee of java.io.BufferedOutputStream.flushBuffer() is BufferedOutputStream.write(int). The Callers-Callees of BufferedOutputStream.flushBuffer() are shown in Figure 8.

Selecting the BufferedOutputStream.write(int) method in the upper Callee panel and clicking the Set Center button shows the callee of java.io.BufferedOutputStream.write(int) is ExtOutputStream.write(int), the method that has been modified. The Callers-Callees of BufferedOutputStream.write(int) are shown in Figure 9.

As mentioned earlier, the next step in reducing System CPU utilization for this application requires a modification to a third-party library, a library that holds the implementation of org.w3c.tidy.OutImpl.outc(int). It may be possible for the maintainers of the third-party library to implement a similar modification to OutImpl.outc(int) as just described and implemented for ExtOutputStream.write(int). However, the performance improvement realized will likely not be as significant since the profile suggests there is more System CPU utilization attributed to the call path of ExtOutputStream.write(int) than to OutImpl.outc(int); refer to Figure 6-3 for attributable System CPU utilization on callers of FileInputStream.write(int). In addition, looking at the amount of System CPU utilization consumed in OutImpl.outc(int), about 6.6 seconds, compared to the total application runtime of 383 seconds is rather small, about 1.5%. Hence, a modification to reduce the amount of System CPU utilization spent in OutImpl.outc(int) would likely not yield more than 1% to 2% improvement.

Applications that perform network I/O can employ a similar, general approach to reduce system CPU utilization as that just described in this section. That is, buffer both the data in the input and output stream used to write and read the data.

An additional strategy to reduce system CPU utilization for applications performing large amounts of network I/O is utilizing Java NIO nonblocking data structures. Java NIO was introduced in Java 1.4.2 with many runtime performance improvements added in Java 5 and Java 6. Java NIO nonblocking data structures allow for the ability to read or write as much data as possible in a single call to a network I/O (read or write) operation. Remember that every network I/O call eventually results in the invocation of an operating system's system call, which consumes system CPU utilization. The challenge with using Java NIO nonblocking data structures is it is more difficult to program than using blocking Java NIO or the older, more traditional Java SE blocking data structures such as java.net.Socket. In a Java NIO nonblocking output operation, you can write as many bytes as the operating system allows to be written. But you have to check the return value of the output operation to determine whether all the bytes you asked to be written have indeed been written. In a Java NIO nonblocking input operation, where you read as many bytes as are available, you have to check how many bytes have been read. You also have to implement some complex programming logic to deal with partially read protocol data units, or multiple protocol data units. That is, you may not be able to read enough bytes in a single read operation to construct a meaningful protocol data unit or message. In the case of blocking I/O, you simply wait until you generally read the specified number of bytes that constitute a full protocol data unit or message. Whether to migrate an application to utilize nonblocking network I/O operations should be decided upon by the application's performance needs. If you want to take advantage of the additional performance promised by using nonblocking Java NIO, you should consider using a general Java NIO framework to make the migration easier. Several popular Java NIO frameworks are available such as Project Grizzly and Apache MINA.

Another area where high System CPU utilization may show up is in applications experiencing heavy lock contention. Identifying lock contention in a profile and approaches to reduce lock contention are discussed in Part 2.

•   •   •

This excerpt is from the book, ‘Java Performance' by Charlie Hunt and Binu John, published in the Addison-Wesley Professional Java Series, ISBN 0137142528, Sept. 2011, Copyright 2012 Oracle America, Inc. For more info please visit: www.informit.com/title/0137142528

More Stories By Charlie Hunt

Charlie Hunt is the JVM performance lead engineer at Oracle. He is responsible for improving the performance of the HotSpot JVM and Java SE class libraries. He has also been involved in improving the performance of the Oracle GlassFish and Oracle WebLogic Server. A regular JavaOne speaker on Java performance, he also coauthored NetBeans™ IDE Field Guide (Prentice Hall, 2005).

More Stories By Binu John

Binu John is a senior performance engineer at Ning, Inc., where he focuses on improving the performance and scalability of the Ning platform to support millions of page views per month. Before that, he spent more than a decade working on Java-related performance issues at Sun Microsystems, where he served on Sun’s Enterprise Java Performance team. John has contributed to developing industry standard benchmarks such as SPECjms2007 and SPECJEnterprise2010; published several performance whitepapers; and contributed to java.net's XMLTest and WSTest benchmark projects.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.

@MicroservicesExpo Stories
I’m told that it has been 21 years since Scrum became public when Jeff Sutherland and I presented it at an Object-Oriented Programming, Systems, Languages & Applications (OOPSLA) workshop in Austin, TX, in October of 1995. Time sure does fly. Things mature. I’m still in the same building and at the same company where I first formulated Scrum.[1] Initially nobody knew of Scrum, yet it is now an open source body of knowledge translated into more than 30 languages[2] People use Scrum worldwide for ...
Thanks to Docker, it becomes very easy to leverage containers to build, ship, and run any Linux application on any kind of infrastructure. Docker is particularly helpful for microservice architectures because their successful implementation relies on a fast, efficient deployment mechanism – which is precisely one of the features of Docker. Microservice architectures are therefore becoming more popular, and are increasingly seen as an interesting option even for smaller projects, instead of being...
As the race for the presidency heats up, IT leaders would do well to recall the famous catchphrase from Bill Clinton’s successful 1992 campaign against George H. W. Bush: “It’s the economy, stupid.” That catchphrase is important, because IT economics are important. Especially when it comes to cloud. Application performance management (APM) for the cloud may turn out to be as much about those economics as it is about customer experience.
When you focus on a journey from up-close, you look at your own technical and cultural history and how you changed it for the benefit of the customer. This was our starting point: too many integration issues, 13 SWP days and very long cycles. It was evident that in this fast-paced industry we could no longer afford this reality. We needed something that would take us beyond reducing the development lifecycles, CI and Agile methodologies. We made a fundamental difference, even changed our culture...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
@DevOpsSummit taking place June 6-8, 2017 at Javits Center, New York City, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @DevOpsSummit at Cloud Expo New York Call for Papers is now open.
SYS-CON Events announced today that Dataloop.IO, an innovator in cloud IT-monitoring whose products help organizations save time and money, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Dataloop.IO is an emerging software company on the cutting edge of major IT-infrastructure trends including cloud computing and microservices. The company, founded in the UK but now based in San Fran...
Synthetic monitoring is hardly a new technology. It’s been around almost as long as the commercial World Wide Web has. But the importance of monitoring the performance and availability of a web application by simulating users’ interactions with that application, from around the globe, has never been more important. We’ve seen prominent vendors in the broad APM space add this technology with new development or partnerships just in the last 18 months.
Updating DevOps to the latest production data slows down your development cycle. Probably it is due to slow, inefficient conventional storage and associated copy data management practices. In his session at @DevOpsSummit at 20th Cloud Expo, Dhiraj Sehgal, in Product and Solution at Tintri, will talk about DevOps and cloud-focused storage to update hundreds of child VMs (different flavors) with updates from a master VM in minutes, saving hours or even days in each development cycle. He will also...
SYS-CON Events announced today that Linux Academy, the foremost online Linux and cloud training platform and community, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Linux Academy was founded on the belief that providing high-quality, in-depth training should be available at an affordable price. Industry leaders in quality training, provided services, and student certification passes, its goal is to c...
The unique combination of Amazon Web Services and Cloud Raxak, a Gartner Cool Vendor in IT Automation, provides a seamless and cost-effective way of securely moving on-premise IT workloads to Amazon Web Services. Any enterprise can now leverage the cloud, manage risk, and maintain continuous security compliance. Forrester's analysis shows that enterprises need automated security to lower security risk and decrease IT operational costs. Through the seamless integration into Amazon Web Services, ...
A lot of time, resources and energy has been invested over the past few years on de-siloing development and operations. And with good reason. DevOps is enabling organizations to more aggressively increase their digital agility, while at the same time reducing digital costs and risks. But as 2017 approaches, the hottest trends in DevOps aren’t specifically about dev or ops. They’re about testing, security, and metrics.
"We provide DevOps solutions. We also partner with some key players in the DevOps space and we use the technology that we partner with to engineer custom solutions for different organizations," stated Himanshu Chhetri, CTO of Addteq, in this SYS-CON.tv interview at DevOps at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
2016 has been an amazing year for Docker and the container industry. We had 3 major releases of Docker engine this year , and tremendous increase in usage. The community has been following along and contributing amazing Docker resources to help you learn and get hands-on experience. Here’s some of the top read and viewed content for the year. Of course releases are always really popular, particularly when they fit requests we had from the community.
Hardware virtualization and cloud computing allowed us to increase resource utilization and increase our flexibility to respond to business demand. Docker Containers are the next quantum leap - Are they?! Databases always represented an additional set of challenges unique to running workloads requiring a maximum of I/O, network, CPU resources combined with data locality.
"We got started as search consultants. On the services side of the business we have help organizations save time and save money when they hit issues that everyone more or less hits when their data grows," noted Otis Gospodnetić, Founder of Sematext, in this SYS-CON.tv interview at @DevOpsSummit, held June 9-11, 2015, at the Javits Center in New York City.
An overall theme of Cloud computing and the specific practices within it is fundamentally one of automation. The core value of technology is to continually automate low level procedures to free up people to work on more value add activities, ultimately leading to the utopian goal of full Autonomic Computing. For example a great way to define your plan for DevOps tool chain adoption is through this lens. In this TechTarget article they outline a simple maturity model for planning this.
The emerging Internet of Everything creates tremendous new opportunities for customer engagement and business model innovation. However, enterprises must overcome a number of critical challenges to bring these new solutions to market. In his session at @ThingsExpo, Michael Martin, CTO/CIO at nfrastructure, outlined these key challenges and recommended approaches for overcoming them to achieve speed and agility in the design, development and implementation of Internet of Everything solutions with...
You often hear the two titles of "DevOps" and "Immutable Infrastructure" used independently. In his session at DevOps Summit, John Willis, Technical Evangelist for Docker, covered the union between the two topics and why this is important. He provided an overview of Immutable Infrastructure then showed how an Immutable Continuous Delivery pipeline can be applied as a best practice for "DevOps." He ended the session with some interesting case study examples.
DevOps tends to focus on the relationship between Dev and Ops, putting an emphasis on the ops and application infrastructure. But that’s changing with microservices architectures. In her session at DevOps Summit, Lori MacVittie, Evangelist for F5 Networks, will focus on how microservices are changing the underlying architectures needed to scale, secure and deliver applications based on highly distributed (micro) services and why that means an expansion into “the network” for DevOps.