Welcome!

Microservices Expo Authors: SmartBear Blog, Flint Brenton, Anders Wallgren, Jenn Steele, XebiaLabs Blog

Related Topics: Java IoT, Microservices Expo, Recurring Revenue

Java IoT: Book Excerpt

Book Excerpt: Java Application Profiling Tips and Tricks

Part 1 - Common types of performance issues

In the book Java Performance, Chapter 5, "Java Application Profiling," presented the basic concepts of using a modern Java profiler such as the Oracle Solaris Studio Performance Analyzer and NetBeans Profiler. It did not, however, show any specific tips and tricks in using the tools to identify performance issues and approaches of how to resolve them. This is the purpose of this excerpt. Its intention is to show how to use the tools to identify performance issues and take corrective actions to resolve them. This excerpt looks at several of the more common types of performance issues the authors have observed through many years of working as Java performance engineers.

Performance Opportunities
Most Java performance opportunities fall into one or more of the following categories:

•  Using a more efficient algorithm. The largest gains in the performance of an application come from the use of a more efficient algorithm. The use of a more efficient algorithm allows an application to execute with fewer CPU instructions, also known as a shorter path length. An application that executes with a shorter path length generally executes faster. Many different changes can lead to a shorter path length. At the highest level of the application, using a different data structure or modifying its implementation can lead to a shorter path length. Many applications that suffer application performance issues often use inappropriate data structures. There is no substitute for choosing the proper data structure and algorithm. As profiles are analyzed, take notice of the data structures and the algorithms used. Optimal performance can be realized when the best data structures and algorithms are utilized.

•  Reduce lock contention. Contending for access to a shared resource inhibits an application's capability to scale to a large number of software threads and across a large number of CPUs. Changes to an application that allow for less frequent lock contention and less duration of locking allow an application to scale better.

•  Generate more efficient code for a given algorithm. Clocks per CPU instruction, usually referred to as CPI, for an application is a ratio of the number of CPU clock ticks used per CPU instruction. CPI is a measure of the efficiency of generated code that is produced by a compiler. A change in the application, JVM, or operating system that reduces the CPI for an application will realize an improvement in its performance since it takes advantage of better and more optimized generated code.

There is a subtle difference between path length, which is closely tied to the algorithm choice, and cycles per instruction, CPI, which is the notion of generating more efficient code. In the former, the objective is to produce the shortest sequence of CPU instructions based on the algorithm choice. The latter's objective is to reduce the number of CPU clocks consumed per CPU instruction, that is, produce the most efficient code from a compiler. To illustrate with an example, suppose a CPU instruction results in a CPU cache miss, such as a load instruction. It may take several hundred CPU clock cycles for that load instruction to complete as a result of the CPU cache miss having to fetch data from memory rather than finding it in a CPU cache. However, if a prefetch instruction was inserted upstream in the sequence of instructions generated by a compiler to prefetch from memory the data being loaded by the load instruction, it is likely the number of clock cycles required to load the data will be less with the additional prefetch instruction since the prefetch can be done in parallel with other CPU instructions ahead of the load instruction. When the load instruction occurs, it can then find the data to be loaded in a CPU cache. However, the path length, the number of CPU instructions executed is longer as a result of the additional prefetch instruction. Therefore, it is possible to increase path length, yet make better use of available CPU cycles.

The following sections present several strategies to consider when analyzing a profile and looking for optimization opportunities. Generally, optimization opportunities for most applications fall into one of the general categories just described.

System or Kernel CPU Usage
Chapter 2, "Operating System Performance Monitoring," suggests one of the statistics to monitor is system or kernel CPU utilization. If CPU clock cycles are spent executing operating system or kernel code, those are CPU clock cycles that cannot be used to execute your application. Hence, a strategy to improve the performance of an application is to reduce the amount of time it spends consuming system or kernel CPU clock cycles. However, this strategy is not applicable in applications that spend little time executing system or kernel code. Monitoring the operating system for system or kernel CPU utilization provides the data as to whether it makes sense to employ this strategy.

The Oracle Solaris Performance Analyzer collects system or kernel CPU statistics as part of an application profile. This is done by selecting the View > Set Data Presentation menu in Performance Analyzer, choosing the Metrics tab, and setting the options to present system CPU utilization statistics, both inclusive or exclusive. Recall that inclusive metrics include not only the time spent in a given method, but also the time spent in methods it calls. In contrast, exclusive metrics report only the amount of time spent in a given method.

Tip
It can be useful to include both inclusive and exclusive metrics when first analyzing a profile. Looking at the inclusive metrics provides a sense of the path the application executes. Looking at the general path an application takes you may identify an opportunity for an alternative algorithm or approach that may offer better performance.

Figure 1 shows the Performance Analyzer's Set Data Presentation form with options selected to present both inclusive and exclusive System CPU metrics. Also notice the options selected report both the raw time value and the percentage of System CPU time.

Figure 1: Set system CPU data presentation

 

Figure 2: Exclusive system CPU

After clicking on the OK button, the Performance Analyzer displays the profile's System CPU inclusive and exclusive metrics in descending order. The arrow in the metric column header indicates how the data is presented and sorted. In Figure 2, the System CPU data is ordered by the exclusive metric (notice the arrow in the exclusive metric header and the icon indicating an exclusive metric).

Figure 2 shows a profile from an application that exhibits high system or kernel CPU utilization. You can see this application consumed about 33.5 seconds of System CPU in the java.io.FileOutputStream.write(int) method and about 11.6 seconds in a method called __write(), or about 65% and 22.5%, respectively. You can also get a sense of how significant the improvement can be realized by reducing the System CPU utilization of this application. The ideal situation for an application is to have 0% System CPU utilization. But for some applications that goal is difficult to achieve, especially if there is I/O involved, since I/O operations require a system call. In applications that require I/O, the goal is to reduce the frequency of making a system call. One approach to reduce the call frequency of an I/O system call is buffer the data so that larger chunks of data are read or written during I/O operations.

In the example shown in Figure 2, you can see the file write (output) operations are consuming a large amount of time as illustrated by the java.io.FileOutputStream.write(int) and __write() entries. To identify whether the write operations are buffered, you can use the Callers-Callees tab to walk up the call stack to see what methods are calling the FileOutputStream.write(int) method and the __write method. You walk up the call stack by selecting one of the callees from the upper panel and clicking the Set Center button. Figure 3 shows the Callers-Callees of the FileOutputStream.write(int) method.

The callers of FileOutputStream.write(int) are ExtOutputStream.write(int) and OutImpl.outc(int). 85.18% of the System CPU attributed to FileOutputStream.write(int) comes from its use in ExtOutputStream.write(int) and 14.82% of it from OutImpl.outc(int). A look at the implementation of ExtOutputStream.write(int) shows:

Figure 3: FileOutputStream.write(int) callers and callees

public void write(int b) throws IOException {
super.write(b);
writer.write((byte)b);
}
A look at the implementation of super.write(b) shows it is not a call to FileOutputStream.write(int):
public void write(int b) throws IOException {
crc = crc * 33 + b;
}

But the writer field in ExtOutputStream is declared as a FileOutputStream:

private FileOutputStream writer;

And it is initialized without any type of buffering:

writer = new FileOutputStream(currentFileName);
currentFileName is a field declared as a String:
private String currentFileName;

Hence, an optimization to be applied here is to buffer the data being written to FileOutputStream in ExtOutputStream using a BufferedOutputStream. This is done rather quickly and easily by chaining or wrapping the FileOutputStream in a BufferedOutputStream in an ExtOutputStream. Here is a quick listing of the changes required:

// Change FileOutputStream writer to a BufferedOutputStream
// private FileOutputStream writer;
private BufferedOutputStream writer;

Then chain a BufferedOutputStream and FileOutputStream at initialization time:

// Initialize BufferedOutputStream
// writer = new FileOutputStream(currentFileName);
writer = new BufferedOutputStream(
new FileOutputStream(currentFileName));

Writing to the BufferedOutputStream, instead of the FileOutputStream, in ExtOutputStream.write(int b) does not require any update since BufferOutputStream has a write() method that buffers bytes written to it. This ExtOutputStream.write(int b) method is shown here:

public void write(int b) throws IOException {
super.write(b);
// No update required here,
// automatically uses BufferedOutputStream.write()
writer.write((byte)b);

}

The other uses of the writer field must be inspected to ensure the use of BufferedOutputStream operates as expected. In ExtStreamOutput, there are two additional uses of the writer field, one in a method called reset() and another in checkResult(). These two methods are as follows:

public void reset() {
super.reset();
try {
if (diffOutputStream != null) {
diffOutputStream.flush();
diffOutputStream.close();
diffOutputStream = null;
}
if (writer != null) {
writer.close();

}
} catch (IOException e) {
e.printStackTrace();
}
}
public void checkResult(int loopNumber) {
try {
writer.flush();
writer.close();

} catch (IOException e) {
e.printStackTrace();
}
check(validiationProperties.getProperty(propertyName));
outProperties.put(propertyName, "" + getCRC());
reset();
}

The uses of writer as a BufferedOutputStream works as expected. It should be noted that the API specification for BufferedOutputStream.close() indicates it calls the BufferedOutputStream.flush() method and then calls the close() method of its underlying output stream, in this case the FileOutputStream.close() method. As a result, the FileOutputStream is not required to be explicitly closed, nor is the flush() method in ExtOutputStream.checkResult(int) required. A couple of additional enhancements worth consideration are:

1. A BufferedOutputStream can also be allocated with an optional buffered size. The default buffer size, as of Java 6, is 8192. If the application you are profiling is writing a large number of bytes, you might consider specifying an explicit size larger than 8192. If you specify an explicit size, consider a size that is a multiple of the operating systems page size since operating systems efficiently fetch memory that are multiples of the operating system page size. On Oracle Solaris, the pagesize command with no arguments reports the default page size. On Linux, the default page size can be obtained using the getconf PAGESIZE command. Windows on x86 and x64 platforms default to a 4K (4096) page size.

2. Change the ExtOutputStream.writer field from an explicit BufferedOutputStream type to an OutputStream type, that is, OutputStream writer = new BufferedOutputStream(), instead of BufferedOutputStream writer = new BufferedOutputStream(). This allows for additional flexibility in type of OutputStream, for example, ByteArrayOutputStream, DataOutputStream, FilterOutputStream, FileOutputStream, or BufferedOutputStream.

Looking back at Figure 3, a second method calls FileOutputStream.write(int) called org.w3c.tidy.OutImpl.outc(int), which is a method from a third-party library used in the profiled application. To reduce the amount of system CPU utilization used in a third-party supplied method, the best approach is to file a bug or enhancement request with the third-party library provider and include the information from the profile. If the source is accessible via an open source license and has acceptable license terms, you may consider further investigating and including additional information in the bug or enhancement request report.

After applying the changes identified in ExtOutputStream, using the BufferedOutputStream and its default constructor (not including the two additional enhancements just mentioned), and collecting a profile, the amount of system CPU utilization drops substantially.  Comparing the profiles in Figure 6-4 to those in Figure 6-2, you can see the amount of inclusive system CPU time spent in java.io.FileOutputStream has dropped from 45.182 seconds to 6.655 seconds (exclusive system CPU time is the second column).

Executing this application workload outside the profiler in a performance testing environment prior to making the modifications reports it took this application 427 seconds to run to completion.  In constrast, the modified version of the application workload that uses the BufferOutputStream in the same performance testing environment reports it runs to completion in 383 seconds. In other words, this application realized about a 10% improvement in its run to completion execution.

In addition, looking at the Callers-Callees tab for java.io.FileOutputStream.write(int), only the call to org.w3c.tidy.OutImpl.outc(int) remains as a significant consumer of the FileOutputStream.write(int) method. The Callers-Callees of FileOutputStream.write(int) are shown in Figure 6-5.

Figure 4: Reduced system CPU utilization

Figure 5: Callers-Callees after changes

Comparing the Callers-Callees in Figure 5, after the changes to ExtStream Output, with the Callers-Callees in Figure 3, prior to the changes, you can see the amount of attributable time spent in org.w3c.tidy.OutImpl.outc(int) stays close to the same. This should not be a surprise since the changes made to ExtStreamOutput now use BufferedOutputStream. But recall that the BufferedOutputStream invokes a FileOutputStream method when any of the underlying buffer in the BufferedOutputStream becomes full, the BufferedOutputStream.flush() method is called, or when the BufferedOutputSteam.close() method is called. If you look back at Figure 4 you see a FileOutputStream.writeBytes(byte[], int, int) method. This is the method that the BufferedOutputStream calls from ExtStreamOutput. Figure 6 shows the Callers-Callees tab for the FileOutputStream.writeBytes(byte[], int, int).

Selecting java.io.FileOutputStream.write(byte[], int, int) method from the upper Callee panel and clicking the Set Center button illustrates that BufferedOutputStream.flushBuffer() is its callee; see Figures 6 and 7.

Figure 6: Callers-Callees of FileOutputStream.writeBytes(byte[],int,int)

Figure 7: Callers-Callees of FileOutputStream.writeBytes(byte[], int, int)

Figure 8: Callers-Callees of BufferedOutputStream.flushBuffer()

Figure 9: Callers-Callees of BufferedOutputStream.write(int)

Selecting the BufferedOutputStream.flushBuffer() method in the upper Callee panel and clicking the Set Center button shows the callee of java.io.BufferedOutputStream.flushBuffer() is BufferedOutputStream.write(int). The Callers-Callees of BufferedOutputStream.flushBuffer() are shown in Figure 8.

Selecting the BufferedOutputStream.write(int) method in the upper Callee panel and clicking the Set Center button shows the callee of java.io.BufferedOutputStream.write(int) is ExtOutputStream.write(int), the method that has been modified. The Callers-Callees of BufferedOutputStream.write(int) are shown in Figure 9.

As mentioned earlier, the next step in reducing System CPU utilization for this application requires a modification to a third-party library, a library that holds the implementation of org.w3c.tidy.OutImpl.outc(int). It may be possible for the maintainers of the third-party library to implement a similar modification to OutImpl.outc(int) as just described and implemented for ExtOutputStream.write(int). However, the performance improvement realized will likely not be as significant since the profile suggests there is more System CPU utilization attributed to the call path of ExtOutputStream.write(int) than to OutImpl.outc(int); refer to Figure 6-3 for attributable System CPU utilization on callers of FileInputStream.write(int). In addition, looking at the amount of System CPU utilization consumed in OutImpl.outc(int), about 6.6 seconds, compared to the total application runtime of 383 seconds is rather small, about 1.5%. Hence, a modification to reduce the amount of System CPU utilization spent in OutImpl.outc(int) would likely not yield more than 1% to 2% improvement.

Tip
Applications that perform network I/O can employ a similar, general approach to reduce system CPU utilization as that just described in this section. That is, buffer both the data in the input and output stream used to write and read the data.

An additional strategy to reduce system CPU utilization for applications performing large amounts of network I/O is utilizing Java NIO nonblocking data structures. Java NIO was introduced in Java 1.4.2 with many runtime performance improvements added in Java 5 and Java 6. Java NIO nonblocking data structures allow for the ability to read or write as much data as possible in a single call to a network I/O (read or write) operation. Remember that every network I/O call eventually results in the invocation of an operating system's system call, which consumes system CPU utilization. The challenge with using Java NIO nonblocking data structures is it is more difficult to program than using blocking Java NIO or the older, more traditional Java SE blocking data structures such as java.net.Socket. In a Java NIO nonblocking output operation, you can write as many bytes as the operating system allows to be written. But you have to check the return value of the output operation to determine whether all the bytes you asked to be written have indeed been written. In a Java NIO nonblocking input operation, where you read as many bytes as are available, you have to check how many bytes have been read. You also have to implement some complex programming logic to deal with partially read protocol data units, or multiple protocol data units. That is, you may not be able to read enough bytes in a single read operation to construct a meaningful protocol data unit or message. In the case of blocking I/O, you simply wait until you generally read the specified number of bytes that constitute a full protocol data unit or message. Whether to migrate an application to utilize nonblocking network I/O operations should be decided upon by the application's performance needs. If you want to take advantage of the additional performance promised by using nonblocking Java NIO, you should consider using a general Java NIO framework to make the migration easier. Several popular Java NIO frameworks are available such as Project Grizzly and Apache MINA.

Another area where high System CPU utilization may show up is in applications experiencing heavy lock contention. Identifying lock contention in a profile and approaches to reduce lock contention are discussed in Part 2.

•   •   •

This excerpt is from the book, ‘Java Performance' by Charlie Hunt and Binu John, published in the Addison-Wesley Professional Java Series, ISBN 0137142528, Sept. 2011, Copyright 2012 Oracle America, Inc. For more info please visit: www.informit.com/title/0137142528

More Stories By Charlie Hunt

Charlie Hunt is the JVM performance lead engineer at Oracle. He is responsible for improving the performance of the HotSpot JVM and Java SE class libraries. He has also been involved in improving the performance of the Oracle GlassFish and Oracle WebLogic Server. A regular JavaOne speaker on Java performance, he also coauthored NetBeans™ IDE Field Guide (Prentice Hall, 2005).

More Stories By Binu John

Binu John is a senior performance engineer at Ning, Inc., where he focuses on improving the performance and scalability of the Ning platform to support millions of page views per month. Before that, he spent more than a decade working on Java-related performance issues at Sun Microsystems, where he served on Sun’s Enterprise Java Performance team. John has contributed to developing industry standard benchmarks such as SPECjms2007 and SPECJEnterprise2010; published several performance whitepapers; and contributed to java.net's XMLTest and WSTest benchmark projects.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@MicroservicesExpo Stories
As the software delivery industry continues to evolve and mature, the challenge of managing the growing list of the tools and processes becomes more daunting every day. Today, Application Lifecycle Management (ALM) platforms are proving most valuable by providing the governance, management and coordination for every stage of development, deployment and release. Recently, I spoke with Madison Moore at SD Times about the changing market and where ALM is headed.
If there is anything we have learned by now, is that every business paves their own unique path for releasing software- every pipeline, implementation and practices are a bit different, and DevOps comes in all shapes and sizes. Software delivery practices are often comprised of set of several complementing (or even competing) methodologies – such as leveraging Agile, DevOps and even a mix of ITIL, to create the combination that’s most suitable for your organization and that maximize your busines...
These days I mostly make my living as a consultant. Consultants in general are probably not the best loved group in the world. It is common to think of consultants wafting-in to your organization, telling you things that you already know and advising you to “change your culture”, whatever that means. Subsequently they depart, no-doubt with a fat fee, and leave you as you were before with the same problems and no progress made.
Struggling to keep up with increasing application demand? Learn how Platform as a Service (PaaS) can streamline application development processes and make resource management easy.
In the rush to compete in the digital age, a successful digital transformation is essential, but many organizations are setting themselves up for failure. There’s a common misconception that the process is just about technology, but it’s not. It’s about your business. It shouldn’t be treated as an isolated IT project; it should be driven by business needs with the committed involvement of a range of stakeholders.
New Relic, Inc. has announced a set of new features across the New Relic Software Analytics Cloud that offer IT operations teams increased visibility, and the ability to diagnose and resolve performance problems quickly. The new features further IT operations teams’ ability to leverage data and analytics, as well as drive collaboration and a common, shared understanding between teams. Software teams are under pressure to resolve performance issues quickly and improve availability, as the comple...
The goal of any tech business worth its salt is to provide the best product or service to its clients in the most efficient and cost-effective way possible. This is just as true in the development of software products as it is in other product design services. Microservices, an app architecture style that leans mostly on independent, self-contained programs, are quickly becoming the new norm, so to speak. With this change comes a declining reliance on older SOAs like COBRA, a push toward more s...
Join IBM June 8 at 18th Cloud Expo at the Javits Center in New York City, NY, and learn how to innovate like a startup and scale for the enterprise. You need to deliver quality applications faster and cheaper, attract and retain customers with an engaging experience across devices, and seamlessly integrate your enterprise systems. And you can't take 12 months to do it.
This is not a small hotel event. It is also not a big vendor party where politicians and entertainers are more important than real content. This is Cloud Expo, the world's longest-running conference and exhibition focused on Cloud Computing and all that it entails. If you want serious presentations and valuable insight about Cloud Computing for three straight days, then register now for Cloud Expo.
The proper isolation of resources is essential for multi-tenant environments. The traditional approach to isolate resources is, however, rather heavyweight. In his session at 18th Cloud Expo, Igor Drobiazko, co-founder of elastic.io, will draw upon their own experience with operating a Docker container-based infrastructure on a large scale and present a lightweight solution for resource isolation using microservices. He will also discuss the implementation of microservices in data and applicat...
Digital means customer preferences and behavior are driving enterprise technology decisions to be sure, but let’s not forget our employees. After all, when we say customer, we mean customer writ large, including partners, supply chain participants, and yes, those salaried denizens whose daily labor forms the cornerstone of the enterprise. While your customers bask in the warm rays of your digital efforts, are your employees toiling away in the dark recesses of your enterprise, pecking data into...
SYS-CON Events announced today that Stratoscale, the software company developing the next generation data center operating system, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Stratoscale is revolutionizing the data center with a zero-to-cloud-in-minutes solution. With Stratoscale’s hardware-agnostic, Software Defined Data Center (SDDC) solution to store everything, run anything and scale everywhere...
SYS-CON Events announced today that Men & Mice, the leading global provider of DNS, DHCP and IP address management overlay solutions, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. The Men & Mice Suite overlay solution is already known for its powerful application in heterogeneous operating environments, enabling enterprises to scale without fuss. Building on a solid range of diverse platform support,...
You deployed your app with the Bluemix PaaS and it's gaining some serious traction, so it's time to make some tweaks. Did you design your application in a way that it can scale in the cloud? Were you even thinking about the cloud when you built the app? If not, chances are your app is going to break. Check out this webcast to learn various techniques for designing applications that will scale successfully in Bluemix, for the confidence you need to take your apps to the next level and beyond.
I had the opportunity to catch up with Chris Corriere - DevOps Engineer at AutoTrader - to talk about his experiences in the realm of Rugged DevOps. We discussed automation, culture and collaboration, and which thought leaders he is following. Chris Corriere: Hey, I'm Chris Corriere. I'm a DevOps Engineer AutoTrader. Derek Weeks: Today we're going to talk about Rugged DevOps. It's a subject that's gaining a lot of traction in the community but not a lot of people are really familiar with wh...
SYS-CON Events announced today that DatacenterDynamics has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY. DatacenterDynamics is a brand of DCD Group, a global B2B media and publishing company that develops products to help senior professionals in the world's most ICT dependent organizations make risk-based infrastructure and capacity decisions.
With DevOps becoming more well-known and established practice in nearly every industry that delivers software, it is important to continually reassess its efficacy. This week’s top 10 includes a discussion on how the quick uptake of DevOps adoption in the enterprise has posed some serious challenges. Additionally, organizations who have taken the DevOps plunge must find ways to find, hire and keep their DevOps talent in order to keep the machine running smoothly.
Call it DevOps or not, if you are concerned about releasing more code faster and at a higher quality, the resulting software delivery chain and process will look and smell like DevOps. But for existing development teams, no matter what the velocity objective is, getting from here to there is not something that can be done without a plan. Moving your release cadence from months to weeks is not just about learning Agile practices and getting some automation tools. It involves people, tooling and ...
Between the mockups and specs produced by analysts, and resulting applications built by developers, there exists a gulf where projects fail, costs spiral, and applications disappoint. Methodologies like Agile attempt to address this with intensified communication, with partial success but many limitations. In his session at 18th Cloud Expo, Charles Kendrick, CTO & Chief Architect at Isomorphic Software, will present a revolutionary model enabled by new technologies. Learn how business and devel...
The notion of customer journeys, of course, are central to the digital marketer’s playbook. Clearly, enterprises should focus their digital efforts on such journeys, as they represent customer interactions over time. But making customer journeys the centerpiece of the enterprise architecture, however, leaves more questions than answers. The challenge arises when EAs consider the context of the customer journey in the overall architecture as well as the architectural elements that make up each...