Welcome!

Microservices Expo Authors: Elizabeth White, Aruna Ravichandran, Liz McMillan, Pat Romanski, Cameron Van Orman

Related Topics: Industrial IoT

Industrial IoT: Article

Translating XML-Based Documents

Translating XML-Based Documents

Xml:tm is an XML namespace-based technology designed to substantially reduce the costs of translating XML documents while at the same time providing the architecture to build consistent authoring systems. xml:tm is an open standard maintained by XML-Intl for the benefit of those involved in the translation of XML documents.

The advent of text in electronic format posed a number of problems for translators. These problems were:
1. How to manage the differing encoding standards and their corresponding font support and availability
2. How to present the text to translators without having to purchase additional copies of the original creation program
3. How to translate the text while preserving the formatting
4. How to build translation memories for these documents to reduce the cost of translation and improve consistency

The problem was exacerbated by the veritable "Tower of Babel" of differing authoring and composition environments, from Interleaf to PageMaker. The typical approach was to write filters that would "lift" the text to be translated from its proprietary embedded environment and to present it to translators in a uniform but equally proprietary translation environment. After translation the text would then be merged with the original document, replacing the source language text.

ISO 8879:1986 SGML
A serious attempt to tackle the plethora of competing formats and their embedded nature was made in 1986 with the advent of ISO 8879 Standard Generalized Markup Language (SGML). The aim of ISO 8879 was to separate the content of documents from their form. SGML arose at a time of great and rapid change in the IT industry. The architects attempted to make the standard as flexible and open to change as possible. This laudable aim unfortunately produced something that was very difficult and expensive to implement. In addition, SGML tackled only the aspect of content. Form was tackled by ISO/IEC 10179:1996 Document Style Semantics and Specification Language (DSSSL), but this proved equally difficult to implement.

HTML
The efforts of the ISO 8879 committee were not in vain. SGML allowed for the creation of HTML, which enabled the World Wide Web to catapult the Internet from a vehicle used by academics and computer scientists to what we know today. HTML was initially based on strict adherence to the SGML standard, but soon diverged as the limitations of ISO 8879 became apparent.

XML
By 1996 the W3C began to look for a solution that was better than HTML. What was required was something that would allow for the semantic exchange of information. It needed to be able to propel the Internet from displaying only static pages to a core semantic Web, allowing for the exchange of data. The efforts of the W3C resulted in XML 1.0. This addressed many of the architectural limitations of SGML, allowing for easier manipulation and parsing of the semantics. In addition to many very good features, the architects of XML introduced a powerful new concept called "namespace." XML namespace allows for the mapping of more than one representation of meaning onto a given document. This feature is now used extensively in supporting standards such as XSL, XSLT, XML Schema, and FOP.

The Future
The success of XML has been phenomenal, although much of it has yet to become visible to end users. XML has spawned much feverish activity in the developer community and has created some excellent open source tools and libraries such as those provided by the Apache Foundation and SourceForge. Even strongly proprietary companies have had to accept the importance of XML. Much excellent work is also being conducted by standards organizations, such as OASIS and the W3C, on XML-based standards like XLIFF for the translation of documents. XML is driving the future of the World Wide Web. It's providing the foundation for important future Web standards such as XML Web services, electronic data exchange, and so on.

Our premise is that the case for XML is so compelling that all leading vendors of word processing and composition systems will have to support it in the near future. In terms of translation, the arguments are even more convincing. It can be up to five times more expensive to translate and correct the layout of documents written in proprietary systems than in XML. Sun Microsystems and the OpenOffice organization already supply an excellent XML-based alternative to proprietary systems, which can also read proprietary systems such as Word and convert them to XML. Microsoft has also announced support for XML in the next version of Office.

With this view of the future in mind we've concentrated our efforts on how best to exploit the very rich syntax and capabilities of XML.

xml:tm
xml:tm radically changes the approach to the translation of XML-based documents. It is an open standard created and maintained by XML-Intl, for the benefit of those involved in the translation of XML documents.

At the core of xml:tm is the concept of "text memory." Text memory is made up of two components:

  1. Author memory
  2. Translation memory
Author memory
XML namespace is used to map a text memory view onto a document. This process is called segmentation. The text memory view works at the sentence level of granularity - the text unit. Each individual xml:tm text unit is allocated a unique identifier. This unique identifier is immutable for the life of the document. As a document goes through its life cycle, the unique identifiers are maintained and new ones are allocated as required. This aspect of text memory is called author memory. It can be used to build author memory systems that can be used to simplify and improve the consistency of authoring.

Figure 1 shows how the tm namespace maps onto an existing XML document.

In Figure 1 "te" stands for "text element" (an XML element that contains text) and "tu" stands for "text unit" (a single sentence or stand-alone piece of text).

Listing 1 is an example of part of an xml:tm document. The xml:tm elements are in bold type to show how xml:tm maps onto an existing XML document (listings can be found at www.sys-con.com/xml/sourcec.cfm).

Translation memory
When an xml:tm namespace document is ready for translation, the namespace specifies the text that is to be translated. The tm namespace can be used to create an XLIFF document for translation.

XLIFF
XLIFF (XML Localization Interchange File Format) is an OASIS standard. XLIFF is another XML format that's optimized for translation. Using XLIFF you can protect the original document syntax from accidental corruption during the translation process. In addition, you can supply other relevant information to the translator such as translation memory and preferred terminology.

Listing 2 is an example of an XLIFF document based on the previous example. The magenta colored text signifies where the translated text will replace the source language text.

When the translation has been completed, the target language text can be merged with the original document to create a new target language version of that document. The net result is a perfectly aligned source and target language document.

Listing 3 is an example of a translated xml:tm document. Figure 2 is an example of the text for the source.

The source and target text is linked at the sentence level by the unique xml:tm identifiers. When the document is revised, new identifiers are allocated to modified or new text units. When extracting text for translation of the updated source document, the text units that have not changed can be automatically replaced with the target language text. The resultant XLIFF file will look like Listing 4.

Different Types of Matching
The matching described in the previous section is called "perfect" matching. xml:tm offers unique translation memory matching possibilities to reduce the quantity of text for translation and provide the human translator with suggested alternative translations. Figure 3 shows how perfect matching is achieved.

The following types of matching are available:
1. Perfect matching: Author memory provides exact details of any changes to a document. Where text units have not been changed for a previously translated document we can say that we have a "perfect match." The concept of perfect matching is an important one. With traditional translation memory systems a translator still has to proof each match, as there is no way to ascertain the appropriateness of the match. Proofing has to be paid for - typically at 60% of the standard translation cost. With perfect matching there is no need to proofread, thereby saving on the cost of translation.
2. Leveraged matching: When an xml:tm document is translated, the translation process provides perfectly aligned source and target language text units. These can be used to create traditional translation memories, but in a consistent and automatic fashion.
3. In-document leveraged matching: xml:tm can also be used to find in-document leveraged matches, which will be more appropriate to a given document than normal translation memory leveraged matches.
4. In-document fuzzy matching: During the maintenance of author memory a note can be made of text units that have changed only slightly. If a corresponding translation exists for the previous version of the source text unit, then the previous source and target versions can be offered to the translator as a type of close fuzzy match.
5. Non-translatable text: In technical documents you can often find a large number of text units that are made up solely of numeric, alphanumeric, punctuation, or measurement items. With xml:tm these can be identified during authoring and flagged as non-translatable, thus reducing the word counts. For numeric and measurement-only text units it is also possible to automatically convert the decimal and thousands designators as required by the target language.

Listing 5 is an example of non-translatable text in xml:tm. An example of the composed text is shown in Figure 4

Word Counts
The output from the text extraction process can be used to generate automatic word and match counts by the customer. This puts the customer in control of the word counts.

XLIFF and Online Translation
XLIFF is an OASIS standard for the interchange of translatable text in XML format. xml:tm translatable files can be created in XLIFF format. The XLIFF format can then be used to create dynamic Web pages for translation. A translator can access these pages via a browser and undertake the whole of the translation process over the Internet. This has many potential benefits. The problems of running filters and the delays inherent in sending data out for translation, such as inadvertent corruption of character encoding or document syntax, or simple human workflow problems, can be totally avoided. Using XML technology it's now possible to reduce and control the cost of translation as well as reduce the time it takes for translation and improve reliability.

An example of a Web-based translator environment can be seen at www.xml-intl.com/demo/trans.html.

Benefits of Using xml:tm
The following is a list of the main benefits of using the xml:tm approach to authoring and translation:
1. The ability to build consistent authoring systems
2. Automatic production of authoring statistics
3. Automatic alignment of source and target text
4. Perfect translation matching for unchanged text units
5. In-document leveraged and modified text unit matching
6. Automatic production of word count statistics
7. Automatic generation of perfect, leveraged, previously modified, or fuzzy matching
8. Automatic generation of XLIFF files
9. Protection of the original document structure
10. The ability to provide online access for translators
11. Can be used transparently for relay translation

Figure 5 shows a traditional translation scenario, and Figure 6 shows an xml:tm translation scenario.

Summary
xml:tm is an open standard created and maintained by XML-Intl based on XML and XLIFF. Full details of the xml:tm definitions (XML Data Type Definition and XML Schema) are available from the XML-Intl Web site (www.xml-intl.com). XML-Intl also supplies an implementation of xml:tm using Java and Oracle.

xml:tm is best suited for enterprise-level implementation for corporations with a large annual translation requirement and a content management system. During the implementation process xml:tm is integrated with the customer's content management system. xml:tm reduces translation costs in the following ways:
1. Translation memory is held by the customer within the documents.
2. Perfect matching reduces translation costs by eliminating the need for translators to proof these matches.
3. Translation memory matching is much more focused than is the case with traditional TM systems, providing better results.
4. It allows for relay translation memory processing via an intermediate language.
5. All TM, extractions, and merge processing is automatic; there is no need for manual intervention.
6. Translation can take place directly via the customer's Web site.
7. All word counts are controlled by the customer.
8. The original XML documents are protected from accidental damage.
9. The system is totally integrated into the XML framework, making maximum use of the capabilities of XML to address authoring and translation.

References

  • The Apache Foundation: http://xml.apache.org
  • SourceForge: www.sourceforge.net
  • OASIS: www.oasis-open.org
  • W3C: www.w3c.org
  • Sun Microsystems: www.sun.com
  • OpenOffice: www.openoffice.org
  • More Stories By Andrzej Zydron

    Andrzej Zydron is a Member of the British Computer Society. He is also technical and research director of xml-Intl Ltd. and sits on the OASIS technical committee for Translation Web Services.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    @MicroservicesExpo Stories
    Transforming cloud-based data into a reportable format can be a very expensive, time-intensive and complex operation. As a SaaS platform with more than 30 million global users, Cornerstone OnDemand’s challenge was to create a scalable solution that would improve the time it took customers to access their user data. Our Real-Time Data Warehouse (RTDW) process vastly reduced data time-to-availability from 24 hours to just 10 minutes. In his session at 21st Cloud Expo, Mark Goldin, Chief Technolo...
    Digital transformation leaders have poured tons of money and effort into coding in recent years. And with good reason. To succeed at digital, you must be able to write great code. You also have to build a strong Agile culture so your coding efforts tightly align with market signals and business outcomes. But if your investments in testing haven’t kept pace with your investments in coding, you’ll lose. But if your investments in testing haven’t kept pace with your investments in coding, you’ll...
    We all know that end users experience the Internet primarily with mobile devices. From an app development perspective, we know that successfully responding to the needs of mobile customers depends on rapid DevOps – failing fast, in short, until the right solution evolves in your customers' relationship to your business. Whether you’re decomposing an SOA monolith, or developing a new application cloud natively, it’s not a question of using microservices – not doing so will be a path to eventual b...
    In his session at 21st Cloud Expo, Michael Burley, a Senior Business Development Executive in IT Services at NetApp, will describe how NetApp designed a three-year program of work to migrate 25PB of a major telco's enterprise data to a new STaaS platform, and then secured a long-term contract to manage and operate the platform. This significant program blended the best of NetApp’s solutions and services capabilities to enable this telco’s successful adoption of private cloud storage and launchi...
    Enterprises are adopting Kubernetes to accelerate the development and the delivery of cloud-native applications. However, sharing a Kubernetes cluster between members of the same team can be challenging. And, sharing clusters across multiple teams is even harder. Kubernetes offers several constructs to help implement segmentation and isolation. However, these primitives can be complex to understand and apply. As a result, it’s becoming common for enterprises to end up with several clusters. Thi...
    Containers are rapidly finding their way into enterprise data centers, but change is difficult. How do enterprises transform their architecture with technologies like containers without losing the reliable components of their current solutions? In his session at @DevOpsSummit at 21st Cloud Expo, Tony Campbell, Director, Educational Services at CoreOS, will explore the challenges organizations are facing today as they move to containers and go over how Kubernetes applications can deploy with lega...
    Today most companies are adopting or evaluating container technology - Docker in particular - to speed up application deployment, drive down cost, ease management and make application delivery more flexible overall. As with most new architectures, this dream takes significant work to become a reality. Even when you do get your application componentized enough and packaged properly, there are still challenges for DevOps teams to making the shift to continuous delivery and achieving that reducti...
    DevOps at Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to w...
    Is advanced scheduling in Kubernetes achievable? Yes, however, how do you properly accommodate every real-life scenario that a Kubernetes user might encounter? How do you leverage advanced scheduling techniques to shape and describe each scenario in easy-to-use rules and configurations? In his session at @DevOpsSummit at 21st Cloud Expo, Oleg Chunikhin, CTO at Kublr, will answer these questions and demonstrate techniques for implementing advanced scheduling. For example, using spot instances ...
    SYS-CON Events announced today that Cloud Academy has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Cloud Academy is the leading technology training platform for enterprise multi-cloud infrastructure. Cloud Academy is trusted by leading companies to deliver continuous learning solutions across Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most...
    The last two years has seen discussions about cloud computing evolve from the public / private / hybrid split to the reality that most enterprises will be creating a complex, multi-cloud strategy. Companies are wary of committing all of their resources to a single cloud, and instead are choosing to spread the risk – and the benefits – of cloud computing across multiple providers and internal infrastructures, as they follow their business needs. Will this approach be successful? How large is the ...
    DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In their Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, and Mark Lav...
    Many organizations adopt DevOps to reduce cycle times and deliver software faster; some take on DevOps to drive higher quality and better end-user experience; others look to DevOps for a clearer line-of-sight to customers to drive better business impacts. In truth, these three foundations go together. In this power panel at @DevOpsSummit 21st Cloud Expo, moderated by DevOps Conference Co-Chair Andi Mann, industry experts will discuss how leading organizations build application success from all...
    DevSecOps – a trend around transformation in process, people and technology – is about breaking down silos and waste along the software development lifecycle and using agile methodologies, automation and insights to help get apps to market faster. This leads to higher quality apps, greater trust in organizations, less organizational friction, and ultimately a five-star customer experience. These apps are the new competitive currency in this digital economy and they’re powered by data. Without ...
    A common misconception about the cloud is that one size fits all. Companies expecting to run all of their operations using one cloud solution or service must realize that doing so is akin to forcing the totality of their business functionality into a straightjacket. Unlocking the full potential of the cloud means embracing the multi-cloud future where businesses use their own cloud, and/or clouds from different vendors, to support separate functions or product groups. There is no single cloud so...
    For most organizations, the move to hybrid cloud is now a question of when, not if. Fully 82% of enterprises plan to have a hybrid cloud strategy this year, according to Infoholic Research. The worldwide hybrid cloud computing market is expected to grow about 34% annually over the next five years, reaching $241.13 billion by 2022. Companies are embracing hybrid cloud because of the many advantages it offers compared to relying on a single provider for all of their cloud needs. Hybrid offers bala...
    With the modern notion of digital transformation, enterprises are chipping away at the fundamental organizational and operational structures that have been with us since the nineteenth century or earlier. One remarkable casualty: the business process. Business processes have become so ingrained in how we envision large organizations operating and the roles people play within them that relegating them to the scrap heap is almost unimaginable, and unquestionably transformative. In the Digital ...
    These days, APIs have become an integral part of the digital transformation journey for all enterprises. Every digital innovation story is connected to APIs . But have you ever pondered over to know what are the source of these APIs? Let me explain - APIs sources can be varied, internal or external, solving different purposes, but mostly categorized into the following two categories. Data lakes is a term used to represent disconnected but relevant data that are used by various business units wit...
    The nature of the technology business is forward-thinking. It focuses on the future and what’s coming next. Innovations and creativity in our world of software development strive to improve the status quo and increase customer satisfaction through speed and increased connectivity. Yet, while it's exciting to see enterprises embrace new ways of thinking and advance their processes with cutting edge technology, it rarely happens rapidly or even simultaneously across all industries.
    It has never been a better time to be a developer! Thanks to cloud computing, deploying our applications is much easier than it used to be. How we deploy our apps continues to evolve thanks to cloud hosting, Platform-as-a-Service (PaaS), and now Function-as-a-Service. FaaS is the concept of serverless computing via serverless architectures. Software developers can leverage this to deploy an individual "function", action, or piece of business logic. They are expected to start within milliseconds...