Welcome!

SOA & WOA Authors: Peter Silva, Maureen O'Gara, Tony Bishop, Mark O'Neill, Yeshim Deniz

Related Topics: SOA & WOA, XML

SOA & WOA: Article

Multicore Systems

Ready or not, here they come

Another type of resource starvation problem stems from an application's inability to take on more work. For example, we encountered an Interactive Voice Response application with only a single thread performing accepts on a ServerSocket. The entire application was therefore bounded by the single-threaded accept rate. The single accept thread was sufficient to saturate the eight-way box because the work arrival rate could exceed the service rate, but with so many processor cores available to process transactions on Azul, the queuing paradigm was inverted and the arrival rate was much lower than the service rate. Therefore adding more load increased the response time even though there were plenty of idle cores due. The solution was to modify the application to launch multiple accept threads on different ports, and once this was accomplished, many more sockets could be handled, which meant more work came into the system, and more processors were loaded up with work. The lesson here is to look not only for contention bottlenecks in your application resulting from locks, but also bottlenecks that might be used to move work into or out of the system.

Once you have provided the application with all of the processor and memory resources it needs, removed as much contention as you can, and made sure that enough work can get into and out of the system, it's time to start looking at improving application algorithms. The theoretical basis for this step is known as parallelization, and a taxonomy has been developed to describe various levels of it. The most parallel algorithms are known as "embarrassingly parallel" (EP), meaning that no particular effort is needed to segment the problem into a very large number of parallel tasks. Probably the most well-known EP problem is Set@Home (and its cousins), but Monte Carlo simulation is frequently encountered in commercial applications to assess risk and predict returns of various financial instruments. Certain sorting and searching algorithms are parallelizable, so look for opportunities to split such problems up into smaller pieces (see Sorting and Searching in www-cs-faculty.stanford.edu/~knuth/taocp.html).

For non-J2EE applications, there are a number of clever technologies for parallelizing algorithms, not the least of which is Doug Lea's Fork/Join Framework, which has been shown to be much more efficient than java.lang.Thread, and it includes techniques for both work stealing and efficient push and pop operations (http://gee.cs.oswego.edu/dl/papers/fj.pdf). Because many large Java applications have dynamic workloads, divide-and-conquer algorithms must be careful to only use free resources, rather than taking over the entire server. Indeed, Azul provides the ability to set minima and maxima for the processor count guaranteed to each VM, so on that platform it would not be possible to starve other applications. Nevertheless, within a single JVM, it would be possible to starve other threads. java.lang.Runtime.availableProcessors()gives a static view of the number of processors available to the JVM, but we are working on extensions that will give an instantaneous view of free resources available to the JVM.

For J2EE applications, it's not possible to spawn threads inside, for example, a servlet or an EJB, and the JSR for an App Server Work Manager now seems to be dead (http://jcp.org/en/jsr/detail?id=237). Developers will have to resort to something like MDBs to pawn work off on other threads. J2EE apps often have transaction types that can be decomposed into smaller pieces that can be performed simultaneously, such as searching multiple databases, prewarming a database or object cache, or processing a large batch of work. By decomposing these transactions into smaller pieces and then generating messages for an MDB, you can gain additional parallelization. This mechanism could be expensive, however, and you should ensure that the additional cost of generating and running the MDBs is worth the effort versus performing the operations in-line to avoid thrashing.

One of the difficulties sysadmins experience with J2EE is the amount of time it takes an application to start up inside a container. It should be possible for the application server vendors to take advantage of SMP systems by multithreading their startup sequences. One caveat for application developers however, is that the current startup is likely single threaded, and there may be unexplored dependencies in the application code that will behave improperly when multithreaded. Indeed, this is a more generic problem: anytime you write a component that is not explicitly single threaded (such as an EJB or a servlet inheriting from SingleThreadModel), be careful to make sure your code is multithread safe!

Within the Java class libraries, there are numerous algorithms that could also benefit from parallelization - specifically those associated with searching and sorting of collections. For example, the current implementation of Collections.sort() is single-threaded (http://java.sun.com/docs/books/tutorial/ collections/algorithms/index.html). Imagine, however, if the sort algorithm were sensitive to the size of the problem and the available system resources to divide the sort up into smaller chunks and merge the results. If the complexity of the sort algorithm is O(f(n)), then dividing the problem into p pieces running on u threads results in a wall-time execution of O( f(n/p)/u ) + C(p)), where C(p) is the cost of dividing the problem into p pieces and merging the results. It is easy to see that as n grows, the benefits of divide and conquer increase so long as no other part of the system is inadvertently starved. Indeed, the cglib package already provides alternative parallel sorting algorithms that could serve as the basis for an improved implementation of Collections.sort() (http://cglib.sourceforge.net/apidocs/net/sf/ cglib/util/ParallelSorter). Hopefully, the Java Community Process (JCP) will take on the task of parallelizing the Java class libraries.

In Java5, we have the addition of Core XML Services for managing XML documents. Although parsing XML is not necessarily parallelizable, it would be possible to create an XML parsing and XML transformation pipeline, consisting of a set of threads connected by queues. The threads in the pipeline would be a lexer, a SAX parser, and a DOM filter. In the parser, the last element in the pipeline is an application-specific processing module that would make callbacks to the application. In the transformer, the last element in the pipeline would be an XSLT engine. By running each pipeline thread on a separate core, large XML documents could be parsed and transformed in much shorter wall-time, which is important for Web services-based systems.

Last, let's take a look at the internals of the JVM itself. The Hotspot VM already runs the JIT in a separate thread, which means that the VM will continue running bytecodes at the same time as it is generating native object code. However since every method is separately compiled, JIT performance will improve when multiple threads are used by the JIT. Azul's JIT (derived from Hotspot) already takes advantage of this, and as JIT heuristics improve, the penalty for throwing out and regenerating code will be reduced, so we will see JIT heuristics improve as well. To the extent that a JVM has to share internal data structures such as class data structures across threads, we will see JVM vendors improve overall JVM performance by further internal turning for increased concurrency.

As we have seen, multicore systems will eventually transform the landscape of Java development to one that is significantly more scalable than what is currently available. The impact will be felt at development time by requiring developers to be more cognizant about concurrency and starvation issues, and to look for ways to increase parallelism in their applications; during load testing and tuning, by requiring developers, performance gurus, and sysadmins be savvy about how resources are used in the application or app server; and even by the app server vendors and JVM writers, who are responsible for making sure that their technologies can take as much advantage of multicore systems as well. The great benefit of multicore systems is that they will be much more scalable than their one to four CPU brethren and require a smaller number of physical boxes to satisfy the same load. This will save space, energy, and management time, and match today's CIO's goals of improving overall data center consolidation.

More Stories By Bob Pasker

Bob Pasker is the deputy CTO of Azul Systems. He has been designing and developing networking, communications, transaction processing, and database products for 25 years. As one of the founders of WebLogic, the first independent Java company (acquired by BEA Systems in 1998), he was the chief architect of the WebLogic Application Server, which today still dominates the market. Bob has provided technical leadership and management for numerous award-winning technologies, including the TribeLink series of routers and remote access devices, and the TMX transaction processing system.

Comments (2) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
SYS-CON Belgium News Desk 01/31/06 05:17:36 PM EST

Every major chip manufacturer has delivered or announced a roadmap for multicore chips that have multiple CPUs on the same piece of silicon. Systems developers are now designing these chips into their entire product line. For Java platform developers, Symmetric Multiprocessing Systems (SMP) should be hidden well below the hardware abstraction layer, but not all applications will get equal benefits from SMP without understanding what's going on under the hood.

SYS-CON Belgium News Desk 01/31/06 05:03:04 PM EST

Every major chip manufacturer has delivered or announced a roadmap for multicore chips that have multiple CPUs on the same piece of silicon. Systems developers are now designing these chips into their entire product line. For Java platform developers, Symmetric Multiprocessing Systems (SMP) should be hidden well below the hardware abstraction layer, but not all applications will get equal benefits from SMP without understanding what's going on under the hood.