Close Window

Print Story

When Exceptions Are the Rule

Every now and then, an IT glitch makes national news. Just a few weeks ago, I read in the paper about an airline that mistakenly sold thousands of roundtrip tickets online at a fare of just a few dollars each. The airline lost hundreds of thousands of dollars from the mistake, though that's really just the tip of the iceberg. The National Institute of Software Technology (NIST) estimates that application errors cost the U.S. economy $59.5 billion per year. Because nearly 80 percent of such errors are discovered after applications have been put into production, exceptions also have a significant impact on the productivity and effectiveness of your IT staff and production support teams. And that's to say nothing of foregone revenue due to poor customer service.

We call these unexpected conditions "exceptions," though they happen all the time. They are as unavoidable as they are harmful.

Unlike my highly publicized airline example, the people who know about exceptions are usually limited to customers, IT teams, and line-of-business managers - and they typically find out about exceptions in that order. It's the customer or the end user of an SOA-based business application who is usually the first to witness the consequences of an exception. Common symptoms include opaque messages (such as "Sorry, unable to process request at this time") on Web sites. Such seemingly mild errors eventually translate into more disruptive business exceptions such as delayed orders, lost packages, rejected insurance claims, and so on.

Due to their distributed and heterogeneous nature, services-based systems are inherently vulnerable to exceptions. Exceptions in SOA environments can be broadly categorized into three classes:

SOA: A Haven for Exceptions
Any developer will tell you that 80 percent of the time required to diagnose an exception is spent simply trying to replicate the scenario. That's due to all the effort of searching through log files and iteratively recoding to add more information to the log. With distributed SOA, possibly stretched across geographies, this task is even more challenging.

Managing exceptions has traditionally been an expensive, extremely manual effort performed by often dedicated application maintenance teams. Since their clues reside in multiple messages that span different services in the business application, exceptions in SOA systems are even harder to detect and diagnose. To begin with, the applications themselves are seldom instrumented to proactively alert on specific exceptions. At best, an application surfaces exceptions as incoherent entries in an error log, such as "Error 00021C: Transaction rejected." These exceptions might be uncovered during routine maintenance of the application. However, as noted earlier, it's the phone calls from vexed customers that usually make IT staff aware of exceptions. Business operations teams seldom come to know about exceptions until it is too late to respond.

So what are you to do about it? You could just shrug your shoulders, accept that exceptions are going to happen, and hope you're not next week's headline news. Or you could look for ways to detect, diagnose, and remedy exceptions before they bring your business to a standstill. Let's explore ways to go about the latter option.

Managing Exceptions in Services-Based Systems
To understand how to manage exceptions in SOA-based environments, start by considering the requisite capabilities. Exceptions must be detected as they occur. To do so, IT and business operations must be able to specify the criteria for spotting exceptions in live business transactions. Typically, you'd look for message patterns that indicate unusual business activity. These may include incongruent reference data, discrepancies in data fields, error messages, and error codes. Sometimes, criteria can be crafted for detecting very specific conditions - for example "raise an exception if a premier customer's order is rejected due to mainframe error code D234200." Other times it's the absence of a message or pattern that's the symptom of an exception.

Since it's impossible to anticipate all patterns, operations teams need to cast as wide a net as possible across their business systems to trap the maximum number of exceptions. As a fallback, they must be able to trace and record all distributed transactions and diagnose this data for the root causes of exceptions.

IT and business teams must know about exceptions immediately. It might be important to alert one or more individuals across different teams based on the nature of the exception. For example, the error code is of interest to the IT staff, while the rejected purchase order and the customer details are important to business types.

The notified personnel must then be able to quickly analyze the situation, understand its cause, and implement a cure. To accomplish this, they need to know not only the exception message pattern but also the context of the business transaction in which it occurred. IT operations must be able to diagnose and resolve the exception in minutes and seconds instead of days and hours. Similarly, business operations must be able to learn about exceptions in real time in order to formalize a resolution before customer service is affected.

For some exceptions, the resolution is clear. In such cases it's important to resolve the exception in-flight by applying automated exception-handling actions.

Why Traditional Approaches Don't Work for SOA
Programmatic exception-handling models have been the mainstay of exception management in business applications. The compilation stage detects and eliminates syntactic errors. Anticipated anomalous business conditions are detected and handled via embedded logic, either in the application source code or in the business process driving the application. Business Process Management (BPM) systems often handle exceptions in process definitions by hardwiring the process definition with corrective actions for a well-defined set of exceptions that might occur while executing the process. Unanticipated conditions and process exits are handled by writing the condition to a log file.

Debugging and testing practices aim to isolate and eliminate logical errors. Quality assurance teams spend countless hours putting the software through scripted production simulations. Then it's up to the consumers of the production systems to report any exceptions to the technical support organization. IT operations staff members depend on applications and system logs to diagnose problems reported by customers. Patches are applied to applications if problems are deemed severe. Additionally, Network Systems Management (NSM) software is used to isolate runtime failures in the hardware or in elements of the physical layer and to trigger alerts.

It's important to note, however, that these approaches are based on the following assumptions:

The problem is, SOA-based applications are distributed, heterogeneous and federated. Specifically: Traditional exception-handling approaches rely on the developer's ability to anticipate all possible exceptional conditions and embed custom instrumentation (as code) into the application to detect and handle them appropriately. Yet, it's impossible for anyone to imagine all of the types of interactions an end user might have with the services of an SOA-based application and code to prevent every possible runtime glitch that might occur.

Coding-based approaches are inapplicable to services you don't own or don't control. Such hardwiring also means that any change or update to the exception-handling capabilities already fabricated within an application requires programmatic changes to the application. IT is unable to respond in a timely fashion to the vagaries of a real-time business environment. Business requirements remain unmet, while IT maintenance overheads rise.

Distributed systems introduce a new order of complexity in detecting exceptional conditions. Since the custom instrumentation is isolated to the service into which it has been programmed, it has no context of the actual business transaction into which it is participating. Exceptional conditions that are related to business transactions spanning one or more Web services across different business applications cannot be detected by exception handling localized within one of the participants. This is particularly true of exceptions that materialize as business events.

What's more, traditional approaches rely on the developer, leading to resource-intensive IT fire drills. To the IT staff, exceptions appear as entries in an error log that simply state something along the lines of "Error 00021C: Transaction rejected." The amount of information available to diagnose the problem is limited, and is often solely dependent on whether or not the developer had been meticulous about exporting all of the available information to an application "log." There is no way to know how a Web service was consumed, when the exception occurred, and in what context. Also, there's no easy mechanism for the IT staff to capture the flow of business transactions end-to-end within the system, as they occur.

In a distributed environment, log entries are likely to be collocated with the Web service in question. Thus, diagnostic information is fragmented across one or more logs, potentially at geographically different locations, or at least on different servers within the same organization. Typically, IT operations teams spend hours sifting through different logs trying to manually piece together the different pieces of the puzzle.

Exception Management Befitting SOA
For dynamic, service-oriented business environments to work, runtime exceptions must be detected and intercepted in-flight and localized resolutions must be executed to either eliminate these exceptions or to mitigate their effect on the rest of the system. The question becomes how to accomplish this. What's needed is a dedicated piece of infrastructure (whether you build it yourself or purchase it as a dedicated solution) that allows you to detect, diagnose, and resolve exceptions. This is the Exception Manager.

The Exception Manager spans a network of services, and each service serves multiple applications within the business environment. It is chartered to:

The Exception Manager has the ability to detect both software errors (system or application errors) and business events in services-based systems and then act upon these conditions in real time.

Abstracting of exception detection and handling from the actual application creates a rapidly adaptive business system. It also allows you to deal with exceptions in system components that are beyond your direct control, such as a partner's Web service.

Now let's take a look at how an Exception Management System might be implemented.

SOA offers a standards-rich environment that enables "intermediaries" to gain runtime insight into the behavior of an application at both the syst em level (such as message sequences and formats) and the business level (such as order number or order amount), via machine-readable XML message content. These intermediaries can also participate in services interactions in real time. A combination of these capabilities can allow "intermediaries" to facilitate intelligent management of exceptions.

Intermediaries can take many forms. An autonomous software agent is probably the best suited to the role of an intermediary. It is able to intercept the live messages, understand their content and context, and initiate appropriate management activities upon detecting exceptions. A collection of such intermediaries can record interactions as well as detect and act upon exceptions in applications spanning multiple services, thus laying the foundation for a comprehensive Exception Management System.

Such an intermediary-based system offers many advantages over traditional approaches for managing exceptions. Unlike a proprietary application execution server or message bus, the agents do not require control of the environment. Furthermore, this type of "on-demand" participation can be administered into this environment without writing new code or altering existing services and applications. Finally, the intermediaries are able to listen for system errors and business events generated by incumbent IT systems while allowing the delegation of appropriate system-level tasks or business activities to these systems via SOA standards.

The time and expense of implementing an effective Exception Management System pays for itself in short order. The maintenance overhead saved when detecting, diagnosing, and resolving the multitude of exceptions that can occur is significant. Business systems become more dynamic, thereby enabling your organization to more quickly and reliably adapt to narrow windows of market opportunity. Customers are better served, and your systems are less likely to lose your company revenue or gain your department coverage on CNN due to an unanticipated glitch.

© 2008 SYS-CON Media Inc.