Welcome!

SOA & WOA Authors: David Deans, Salvatore Genovese, Yeshim Deniz, Christopher Keene, Dave Haynes

Related Topics: SOA & WOA

SOA & WOA: Article

SOA Data Strategy

Vital to a successful SOA transformation

Resolving Data Conflicts
Besides these scoping and profiling exercises to manage data quality, it's also imperative to resolve value-level conflicts that exist in the data. These conflicts can be categorized into three major types (C.H. Goh, "Representing and Reasoning about Semantic Conflicts in Heterogeneous Information Systems," Sloan School of Management, Massachusetts Institute of Technology, 16-22, January 1997.):

  • Structural and Formatting Conflicts: Conflicts in the formats of the data values and schemas used for structuring and organizing the data. Some examples of structural and formatting conflicts include type conflicts in which different data types are used to represent the same element. For example, customer ID is stored as a double in one system and as a string in another system. Another example is labeling conflicts where similar concepts are labeled differently such as "supplier" versus "vendor."
  • Semantic conflicts: Conflicts in how the meanings of certain data values are interpreted. Examples of semantic conflicts include naming in which the same concept is expressed with different values. This is similar to the labeling conflict but occurs in the data value, whereas with labeling, the conflict is in the label on the data structure (metadata). The significance of this difference is that with the semantic naming conflict, detection and resolution may be more difficult, and the detection and resolution mechanism has to be applied multiple times over the entire set of values.
  • Intensional conflicts: Conflicts arising when consumer assumptions and expectations of data content differ from those of data producers. These conflicts are prevalent when structural representations are identical but the data domains that are encapsulated in these structures vary with the data producers. Intensional conflicts often arise when varying producers have fundamentally different conceptions of integrity constraints between related entities: cardinality, nillability, or uniqueness.

    These data conflicts can often be addressed by using commercial data management tools and methodologies, as well as enterprise data modeling software. Another emerging possibility is semantics-centric modeling environments. Instead of hard-coding data cleansing routines, these tools use a semantic description of the enterprise - the business concepts and relationships between those concepts, as well as any business rules governing the relationships - and provide a mechanism to describe how legacy systems support the semantics of the enterprise. This useful abstraction lets the enterprise deterministically identify how each enterprise data asset supports the enterprise business functions, as well as any gaps between the enterprise semantic model and the underlying data representation schemes. This modeling approach can then be used to determine where physical data conflicts or duplications may exist, as well as forward engineer data consolidation and cleansing scripts.

    Data Access Controls
    In traditional application architectures, data access security is typically governed by application-specific mechanisms. In this environment, each source has its own set of users, roles, and access control policies. Which means that user profiles, roles, and access control policies lack consistency across the enterprise. An SOA environment magnifies this problem by making data sources visible across the organization. So it becomes increasingly important to move away from individual application-specific and data source-specific mechanisms in favor of enterprise-level SOA identity management and access control mechanisms.

    This means that when creating the central data services layer, the data sources must rely on central provisioning of some security functions so they can be managed centrally. The challenge is in finding the right balance between the security functions that should be managed centrally and what should be managed as part of the data sources. There are several options in implementing such a scheme, including a centrally managed data security layer, or using layered authorization through multiple policy decision points (PDP).

    With the central management option, the data sources relinquish security and rely solely on the data services to protect the access to their data. Within each data source, a single user profile is created for the data service that has full access to the data. Any request to the data through this service is authorized through this user profile. So there's no longer a concern about whether the principal's identity from the overarching security domain exists or means anything in the data source. However, this option pushes security checks into the data service layer and reduces the granularity of accountability. As a consequence, any access control policies from the data source along with the associated roles and privileges should now be re-created and maintained at the central enterprise points.

    In contrast, layering the use of multiple policy decision points encourages the reuse of existing authorization capabilities, user profiles, and access control policies of the underlying data sources. This approach allows some of the more fine-grained access control decisions to be made at the data sources rather than elevating them into the enterprise layer. Although many variations exist for this design, the premise is that different layers of authorization with multiple PDPs are making the decisions. The basic flow of this approach is as follows: Authentication still occurs at the edge using enterprise authentication services. Requests for data originate at different security domains in the enterprise. A PDP in each of these domains evaluates requests for resources in that domain. When a data service is invoked it calls the enterprise policy decision point to authorize access to the data service as well as the specific operation requested. The data service then delegates the decision to each data source so they can authorize access to their specific data object(s). Thus, coarse-grained decisions are made at the enterprise level while finer-grained decisions use data source-specific profiles and policies that aren't exposed to the enterprise.

    Data Services Architecture
    From an architectural perspective, the heart of this solution is an enterprise layer that logically centralizes access to the data spread across the enterprise. This set of logically centralized data services provides several architectural advantages. First, the enterprise can assert greater control over the governance and implementation of data access mechanisms. Second, clients use a consistent mechanism to access data. Third, the enterprise can design and implement a solution in a holistic fashion instead of the typical one-off models that are the norm in data integration. Finally, besides the basic Create, Read, Update, and Delete (CRUD) operations, the underlying architecture must also support data aggregation, inter-service transactions, and multiple access and usage patterns, all while ensuring acceptable levels of quality of service.

    Data Aggregation Scenarios
    This data services layer acts as a façade over the enterprise assets - it logically provides access to enterprise data assets in a singular manner, while physically dispatching requests and aggregations across relevant co-located assets. Three main scenarios should be considered for data aggregation:

  • The unified view of a data entity is defined by combining attributes from multiple sources. The actual data of that view is also obtained by combining data from multiple sources. The main difficulty with this aggregation scenario is linking related data from multiple systems that may not share unique identifiers. This often requires the creation of a cross-reference table to link related records.
  • The unified view of an entity is derived from the model of a single source. However, the actual data is obtained from multiple sources with different models. The main difficulty here is an understanding of de-duplication - tapping multiple systems to get a complete set of instance data can result in multiple instance records about the same thing. In this case, once duplicates are identified, which one survives to become the "golden copy"? In this model, identification and use of authoritative sources becomes important.
  • The unified view of an entity is partitioned across multiple instances of a single model. Data distribution can be the result of planned partitioning or just the ad hoc use of the same source system across multiple departments resulting in multiple instances. In case of planned partitioning, the partitioning schema can be used to optimize the performance of the data access layer, while in the case of ad hoc distribution duplicates are a problem and should be addressed through the use of authoritative data sources.

    Some of these aggregation capabilities can be supported through Enterprise Information Integration (EII) technology, which provides SOA-centric capabilities for accessing and querying co-located data in real-time. EII products provide adapters to legacy data sources and expose their underlying data in a service-oriented fashion. EII is best used in discrete query-based mechanisms where data volumes are moderate. EII isn't meant to be a replacement for traditional ETL (extract, transform, load), EAI (enterprise application integration), or MDM (master data management) technologies. For example, some of the aggregation scenarios requiring de-duplication capabilities can require the use of MDM technologies.

    The data services layer allows creates and updates to be requested once by a client and then decomposed by the supporting architecture into individual write commands to targeted data sources. Therefore, the architecture must support transactionality - ensuring that writes are consistent so that underlying data across all affected data sources are left in a consistent state. This isn't significantly different from current data integration pains. However, most systems today requiring multi-write transaction capabilities leverage the XA standards. Similar standards for the Web Services environment are only starting to emerge. OASIS has recently formed a Web Services Transaction Technical Committee (WS-TX TC) responsible for stewarding WS-AtomicTransaction, WS-Coordination, and WS-BusinessActivity specifications through the standardization process. None of these standards have been ratified yet. Because these specifications are still being developed, most SOA-related transaction support is being custom-developed, typically through the use of homegrown compensation mechanisms - effectively an "undoing" of a previously executed service invocation. Instead of providing true rollback semantics, compensation is an additional service invocation that rewrites data to its original state. While it may be beneficial to take a wait-and-see approach to building transactionality, solutions aligned with the three specifications seeding WS-TX deliberations will likely provide the path of least resistance to standards compliance.

    Quality of Service
    With all the data access operations going through this data services layer, a major concern is the potential bottleneck at this layer that may limit scalability. The obvious way to resolve this problem is to create a clustered environment with multiple instances of this data services layer.

    There are complexities with clustering dependant on whether the enterprise is using a purely federated approach or has some level of data replication. If using a purely federated approach, then it can be simple to have a cluster with multiple instances. However, the architecture must still address the issue of affinity for a particular instance - especially in the case of inter-service transactions. The architecture must address questions such as: Are all operations that are part of a transaction forced to go to the same data service instance? Can different operations that use different data service instances still be part of the transaction?

    A simple solution is to require all operations in a single transaction to interact with a single service instance. However, this solution isn't without its disadvantages since it can affect how well the load is distributed across the cluster. With some replication, clustering becomes more difficult. In addition to the server affinity issue, the architecture must include a partitioning strategy. This strategy answers questions such as: Do all instances of the data services allow access to all the data? Or are data services partitioned so that only certain instances allow access to certain data?

    Data Access and Usage Patterns
    It's important to note that different applications have different data access and usage patterns. Some applications can produce many transactions but access only a small amount of data in each transaction. For other applications, the transaction throughput can be small but the volume of data that's accessed very large. The way to tune data source performance for these patterns is very different. When using a data services solution to provide centralized access to enterprise data sources, the enterprise must accommodate all the various access and usage patterns of the applications that will be integrated with this solution. Tuning the infrastructure to support a single application's performance requirements is complicated, trying to tune it to adequately support multiple patterns of use and access will be even more difficult. Often, there will be conflicting configurations - something that optimizes the performance of one application will degrade the performance of another. The enterprise should analyze and model the access and use patterns of the applications that will be using the data services and ensure that well-defined performance criteria for each scenario have been developed. Additionally, enough time should be planned for testing the performance of a particular solution with simulations that reflect the access and usage patterns that are common to the enterprise environment.

    Summary
    Harmonizing data assets has always been a challenging problem; the problems and urgency are further exacerbated when migrating to an SOA. Developing a strategy for handling this kind of transition is essential to properly enabling data access in an enterprise SOA environment. By developing appropriate requirements and use cases and by analyzing data assets and data usage, organizations can better understand the breadth and depth of their data integration issues and begin to take steps to address them. Ultimately, every organization must develop a strategy tailored to its specific needs, but the overall approach described in this article provides guidance in understanding what types of questions should be asked and how to leverage possible technology solutions to address the resulting issues that are identified. This guidance will enable organizations to fully leverage and exploit their most important strategic asset: their data.

  • More Stories By Tieu Luu

    Tieu Luu works at Booz Allen Hamilton where he helps the U.S. government create and implement strategies and architectures that apply innovative technologies and approaches in IT. You can read more of Tieu’s writing at his blog at http://tieuluu.com/blog.

    More Stories By Sandeep Maripuri

    Sandeep Maripuri is an associate with Booz Allen Hamilton where he designs and implements data sharing architectures that apply service-oriented concepts.  Prior to joining Booz Allen Hamilton, Sandeep held architecture and engineering positions in both large consulting firms and a commercial software startup, where he was an architect and lead engineer of one of the first commercially-available semantic data interoperability platforms.

     

    More Stories By Riad Assir

    Riad Assir is a senior technologist with Booz Allen Hamilton where he designs enterprise systems for commercial and government clients.  Prior to Booz Allen Hamilton, Riad held Senior technology positions at companies such as Thomson Financial, B2eMarkets and Manugistics, where he worked on large supply chain systems development.

    Comments (2) View Comments

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    Most Recent Comments
    Business Integration Architecture & Technology 06/19/06 06:24:01 PM EDT

    Trackback Added: SOA and Data Architecture ; A data access tier is an architectural component of many systems designs. Reusable data objects are a fundamental building block for SOA. Yet many architects are ignoring the data tier for SOA.

    SOA Web Services Journal News 06/15/06 11:47:03 AM EDT

    The adoption of Service Oriented Architecture (SOA) promises to further decouple monolithic applications by decomposing business functions and processes into discrete services. While this makes enterprise computing assets more accessible and reusable, SOA implementation patterns are primarily an iteration over previous application development models. Like most application development evolutions, SOA approaches inject more layers and flexibility into the application tier, but have often neglected the most fundamental building block of all applications: the underlying data.