Welcome!

Microservices Expo Authors: Elizabeth White, Liz McMillan, Cynthia Dunlop, Yeshim Deniz, VictorOps Blog

Related Topics: SYS-CON BRASIL, Industrial IoT, Microservices Expo

SYS-CON BRASIL: Article

i-Technology Viewpoint: The Performance Woe of Binary XML

XML Overhead – The Scapegoat's Story

[This article was our best-read XML-related article in 2006.]

Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store. While most agree that verbosity is inherent to XML's way of encoding information (e.g., extensive use of tags and pointy brackets), the explanation of XML's perceived performance issue remains inconclusive. A popular belief is that since XML is human-readable text, it has to be slow and inefficient. And by the same token, proponents of binary XML seem to suggest that a compact encoding format, most noticeably the binary XML, would automatically lead to better processing performance.

Does it make sense for doctors to prescribe medicine without a diagnosis? Whether those perceptions and beliefs have a grain of truth or not, one thing is certain: Without a solid understanding of XML's performance issue, it will be difficult, if not impossible, to devise meaningful solutions. So in this article, I'll attempt to dissect XML's performance issue by focusing on three key questions: (1) Does XML have a performance issue? (2) What is the real culprit behind XML's slow performance? (3) Can binary XML fundamentally solve the problem.

Performance Is Not an XML Issue Per Se
In networking system design, the OSI (Open System Interconnect) stack is the standard model that divides the functions of the network into seven layers. Each layer only uses the layer below and only exports functionalities to the layer above. Compared with monolithic approaches, the advantages of OSI's layered approach are the robustness, resilience, and extensibility of the networking system in the face of rapid technology evolution. For example, any Voice over IP applications will work without knowing the physical layer of the networks (e.g., using copper, fiber cable, or Wi-Fi), or the data link layer (e.g., Ethernet, Frame Relay, or ATM).

Likewise, we can take a similar layered approach to modeling XML-based applications. Figure 1 is a simplified view of this "XML protocol stack" consisting of three layers: the XML data layer, the XML parsing layer, and the application layer. The application layer only uses functions exported by the XML parsing layer, which translates the data from its physical representation (XML) into its logical representation (the infoset).

Several observations can be made concerning the perceived performance issue of XML. First and foremost, because the XML application can only go as fast as the XML parsers can process XML messages, performance is actually not an issue of XML per se, but instead an issue of the XML parsing layer. If an XML routing application (assuming minimum overhead at the application layer) can't keep pace with incoming XML messages at a gigabit per second, it's most likely because the throughput of XML parsing is less than that of the network. Consequently, the correct way to boost the application performance is to optimize the XML parsing layer. Just like tuning any software applications, the best way to do it is to discover then find ways to reduce or eliminate the overhead in XML parsers.

Object Allocation - the Real Culprit To get a feel of the performance offered by current XML parsing technologies, I benchmarked the parsing throughput of the two types of widely used XML parsers - Xerces DOM and Xerces SAX. The benchmark applications are quite simple. They first read an XML document into memory then invoke parser routines to parse the document for a large number of iterations and the parsing throughput is calculated by dividing the file size by the average latency of each parsing iteration. For SAX parsing, the content handler is set to NULL. Several XML document in varying sizes, tagginess, and structural complexity were chosen for the benchmark. The results, produced by a two-year old 1.7GHz Pentium M laptop, are summarized in Figure 2. The complete report, which includes test setup, methodology, and code, is available online at http://vtd-xml.sf.net/benchmark.html.

The benchmark results are quite consistent with the well-known performance characteristics of DOM and SAX. First of all, the raw parsing throughput of SAX, at between 20MB/sec~30MB/sec, is actually quite respectable. However, by not exposing the inherent structural information, SAX essentially treats XML as CSV with "pointy brackets," often making it prohibitively difficult to use and unsuitable as a general-purpose XML parser. DOM, on the other hand, lets developers navigate in-memory tree structure. But the benchmark results also show that, except for very small files, DOM is typically three to five times slower than SAX. Because DOM parsers internally usually use SAX to tokenize XML, by comparing the performance differences, it's clear that building the in-memory tree structure is where the bottleneck is. In other words, allocating all the objects and connecting them together dramatically slow everything down. As the next step, I ran JProfiler (from EJ-technology) to identify where DOM and SAX parsing spend all the CPU cycles. The results confirmed my early suspicion that the overhead of object allocation overwhelmingly bottlenecks DOM parsing and, to a lesser (but still significant) degree, SAX parsing as well.

Some readers may question that DOM - only an API specification - doesn't preclude efficient, less object-intensive, implementations. Not so. The DOM spec is in fact based entirely on the assumption that the hierarchical structure consists entirely of objects exposing the Node interface. The most any DOM implementation can do is alter the implementation of the object sitting behind the Node interface, and it's impossible to rip away the objects altogether. So, if the object creation is the main culprit, the DOM spec itself is the accomplice that makes any performance-oriented optimizations prohibitively difficult. And this is why, after the past eight years and countless efforts by all major IT companies, every implementation of DOM has only seen marginal performance improvement.

Binary XML Solves the Wrong Problem
Can binary XML fundamentally solve the performance issue of XML? Well, since the performance issue belongs to XML parsers, a better question to ask is whether binary XML can help make parsing more efficient. The answer has two parts - one for SAX, the other for DOM.

Binary XML can improve the raw SAX parsing speed. There are a lot of XML-specific syntax features that binary XML can choose not to inherit. For example, binary XML can replace the ending tags with something more efficient, entirely avoiding attributes so SAX parsing no longer does uniqueness checking, or find other ways to represent the document structure. There are many ways to trim CPU cycles off SAX's tokenization cost. And proponents of binary XML often cites up to a 10x speedup for the binary XML version of SAX over text XML.

But they ignored the simple fact that SAX has serious usability issues. The awkward forward-only nature of SAX parsing not only requires extra implementation effort, but also incurs performance penalties when the document structure becomes only slightly complex. If developers choose not to scan the document multiple times, they'll have to buffer the document or build custom object models. In addition, SAX doesn't work well with XPath, and in general can't drive XSLT processing (binary XML has to be transformed as well, right?). So pointing to SAX's raw performance for binary XML as a proof of binary XML's merit is both unfair and misleading. The bottom line: for a XML processing model to be broadly useful, the API must expose the inherent structure of XML.

Unfortunately, binary XML won't make much difference improving DOM parsing for the simple reason that DOM parsing generally spends most CPU cycles building in-memory tree structure, not on tokenization. So the speedup of DOM due to faster SAX parsing is quite limited. In other words, DOM parsing for binary XML will be slow as well. And going one step further, object-graph based parser will have the same kind of performance issue for virtually any data format such as DCOM, RMI, or CORBA. XML is merely the scapegoat.

Reduce Object Creation - the Correct Approach
From my benchmark and profiling results, it's quite easy to see that the best way to remove the performance bottleneck in XML parsing is to reduce the object-creation cost of building the in-memory hierarchical structure. And the possibilities are, in fact, endless, and only limited by the imagination. Good solutions can and will emerge. And among them is VTD-XML (http://vtd-xml.sf.net). To achieve high performance, VTD-XML approaches XML parsing through the two object-less steps: (1) non-extractive tokenization and (2) hierarchical-directory-based random access. The result: VTD-XML drastically reduces the number of objects while still exporting the hierarchical structure to application developers and significantly outperforming SAX. (See Figure 3)

Summary
To summarize, the right way is to find better parsing techniques beyond DOM and SAX that significantly reduce the object-creation cost of building XML's tree structure. Binary XML won't fundamentally solve XML's performance issue because the problem belongs to XML parsers, not XML.

More Stories By Jimmy Zhang

Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.

Comments (10) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
RTL 09/08/06 10:44:16 AM EDT

Sure the network may be "fast enough" from a raw mbps perspective. The problem is, in a real-world scenario your application may not be the only user of that bandwidth. And if everyone is stuffing the network full of data bloated with excessive overhead (such as what XML specifies), you can choke the network. It's not rocket science -- cut the data-bloat on the network in half and you double the capacity/performance. Even if the network is not running at full capacity, cutting the traffic in half (via more efficient data transport protocols) makes everything run more smoothly and doubles the lifetime of the existing infrastructure (i.e., it will take twice as long to fill up), ultimately reducing cost.

jzhang 09/07/06 09:46:36 PM EDT

This article is entirely about parsing performance, not the size of XML... the problem is that parsing can't keep up with the speed of the network...
XmL is not slow, *xml parsers* are slow

RTL 09/07/06 03:32:13 PM EDT

This article seems to miss the point of the performance critism of XML. The problem is not so much one of parsing (although that is an issue), but network bandwidth. From a bandwith perspective, XML is just about the world's most inefficient protocol one could devise for transmitting data. If binary XML could cut the file size even just in half, that doubles an applications network performance.

There is no reason why you could not perform the same job as text-XML with binary-XML. You would gain significant performance benefits with the only downside being that you lose immediate human readability. You know... sometimes it seems that XML was embraced and championed by a lot of young, wet-behind-the-ears HTML hackers who didn't know how to read hex. :)

alucinor 08/31/06 05:32:15 PM EDT

> queZZtion commented on the 31 Aug 2006:
> MSFT submitted OpenXML to ECMA, anyone
> know if they plan to submit it to ISO too?

Microsoft's Open XML is just a delay tactic -- their old strategy of vaporware vaporware vaporware ... that sometimes materializes at the last second, never as grand as promised, but having accomplished its goal of causing everyone to say "Let's wait and see what Microsoft will do first!"

And MOX is Latin for "soon". Coincidence?!

queZZtion 08/31/06 05:30:54 PM EDT

MSFT submitted OpenXML to ECMA, anyone know if they plan to submit it to ISO too?

An0n 08/31/06 05:29:34 PM EDT

HR-XML Anyone? View link: http://www.hr-xml.org/

Jimmy Zhang 08/02/06 09:57:28 PM EDT

Responding to your comment #1

It doesn't see that we are disagreeing, because if the CPU is devoting much cycles on application logic, then there is less incentive going to binary XML with the hope of speeding up overall app performance

Concerning your comment #2, built-in indexing is not just for binary XML, it can be done for XML as well, so this argument is quite weak... or do I misintepret anything??

Henrik 07/31/06 04:56:45 AM EDT

I don't quite buy your argument.

1) You are only looking at a single process assuming that the CPU has nothing better to do than parsing and processing. On a server any performance improvement will help server throughput.

2) The main benefit of a binary format would be built in indexing. If done properly DOM wouldn't have to build much of a structure at all but rather work directly on the binary image and extract nodes on request.

3) I don't really see a SAX parser getting a drastic improvement though.

XML News Desk 07/30/06 11:37:30 AM EDT

Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store. While most agree that verbosity is inherent to XML's way of encoding information (e.g., extensive use of tags and pointy brackets), the explanation of XML's perceived performance issue remains inconclusive. A popular belief is that since XML is human-readable text, it has to be slow and inefficient. And by the same token, proponents of binary XML seem to suggest that a compact encoding format, most noticeably the binary XML, would automatically lead to better processing performance.

XML News Desk 07/30/06 08:54:25 AM EDT

Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store. While most agree that verbosity is inherent to XML's way of encoding information (e.g., extensive use of tags and pointy brackets), the explanation of XML's perceived performance issue remains inconclusive. A popular belief is that since XML is human-readable text, it has to be slow and inefficient. And by the same token, proponents of binary XML seem to suggest that a compact encoding format, most noticeably the binary XML, would automatically lead to better processing performance.

@MicroservicesExpo Stories
SYS-CON Events announced today the Docker Meets Kubernetes – Intro into the Kubernetes World, being held June 9, 2016, in conjunction with 18th Cloud Expo | @ThingsExpo, at the Javits Center in New York, NY. Register for 'Docker Meets Kubernetes Workshop' Here! This workshop led by Sebastian Scheele, co-founder of Loodse, introduces participants to Kubernetes (container orchestration). Through a combination of instructor-led presentations, demonstrations, and hands-on labs, participants learn ...
IoT generates lots of temporal data. But how do you unlock its value? How do you coordinate the diverse moving parts that must come together when developing your IoT product? What are the key challenges addressed by Data as a Service? How does cloud computing underlie and connect the notions of Digital and DevOps What is the impact of the API economy? What is the business imperative for Cognitive Computing? Get all these questions and hundreds more like them answered at the 18th Cloud Expo...
SYS-CON Events announced today the How to Create Angular 2 Clients for the Cloud Workshop, being held June 7, 2016, in conjunction with 18th Cloud Expo | @ThingsExpo, at the Javits Center in New York, NY. Angular 2 is a complete re-write of the popular framework AngularJS. Programming in Angular 2 is greatly simplified. Now it’s a component-based well-performing framework. The immersive one-day workshop led by Yakov Fain, a Java Champion and a co-founder of the IT consultancy Farata Systems and...
Agile teams report the lowest rate of measuring non-functional requirements. What does this mean for the evolution of quality in this era of Continuous Everything? To explore how the rise of SDLC acceleration trends such as Agile, DevOps, and Continuous Delivery are impacting software quality, Parasoft conducted a survey about measuring and monitoring non-functional requirements (NFRs). Here's a glimpse at what we discovered and what it means for the evolution of quality in this era of Continuo...
Last week I had the pleasure of speaking on a panel at Sapphire Ventures Next-Gen Tech Stack Forum in San Francisco. Obviously, I was excited to join the discussion, but as a participant the event crystallized not only where the larger software development market is relative to microservices, container technologies (like Docker), continuous integration and deployment; but also provided insight into where DevOps is heading in the coming years.
SYS-CON Events announced today that BMC Software has been named "Siver Sponsor" of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2015 at the Javits Center in New York, New York. BMC is a global leader in innovative software solutions that help businesses transform into digital enterprises for the ultimate competitive advantage. BMC Digital Enterprise Management is a set of innovative IT solutions designed to make digital business fast, seamless, and optimized from mainframe to mo...
Join us at Cloud Expo | @ThingsExpo 2016 – June 7-9 at the Javits Center in New York City and November 1-3 at the Santa Clara Convention Center in Santa Clara, CA – and deliver your unique message in a way that is striking and unforgettable by taking advantage of SYS-CON's unmatched high-impact, result-driven event / media packages.
In the rush to compete in the digital age, a successful digital transformation is essential, but many organizations are setting themselves up for failure. There’s a common misconception that the process is just about technology, but it’s not. It’s about your business. It shouldn’t be treated as an isolated IT project; it should be driven by business needs with the committed involvement of a range of stakeholders.
You might already know them from theagileadmin.com, but let me introduce you to two of the leading minds in the Rugged DevOps movement: James Wickett and Ernest Mueller. Both James and Ernest are active leaders in the DevOps space, in addition to helping organize events such as DevOpsDays Austinand LASCON. Our conversation covered a lot of bases from the founding of Rugged DevOps to aligning organizational silos to lessons learned from W. Edwards Demings.
Earlier this week, we hosted a Continuous Discussion (#c9d9) on Continuous Delivery (CD) automation and orchestration, featuring expert panelists Dondee Tan, Test Architect at Alaska Air, Taco Bakker, a LEAN Six Sigma black belt focusing on CD, and our own Sam Fell and Anders Wallgren. During this episode, we discussed the differences between CD automation and orchestration, their challenges with setting up CD pipelines and some of the common chokepoints, as well as some best practices and tips...
When I talk about driving innovation with self-organizing teams, I emphasize that such self-organization includes expecting the participants to organize their own teams, give themselves their own goals, and determine for themselves how to measure their success. In contrast, the definition of skunkworks points out that members of such teams are “usually specially selected.” Good thing he added the word usually – because specially selecting such teams throws a wrench in the entire works, limiting...
SoftLayer operates a global cloud infrastructure platform built for Internet scale. With a global footprint of data centers and network points of presence, SoftLayer provides infrastructure as a service to leading-edge customers ranging from Web startups to global enterprises. SoftLayer's modular architecture, full-featured API, and sophisticated automation provide unparalleled performance and control. Its flexible unified platform seamlessly spans physical and virtual devices linked via a world...
Automation is a critical component of DevOps and Continuous Delivery. This morning on #c9d9 we discussed CD Automation and how you can apply Automation to accelerate release cycles, improve quality, safety and governance? What is the difference between Automation and Orchestration? Where should you begin your journey to introduce both?
SYS-CON Events announced today TechTarget has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. TechTarget is the Web’s leading destination for serious technology buyers researching and making enterprise technology decisions. Its extensive global networ...
SYS-CON Events announced today that Commvault, a global leader in enterprise data protection and information management, has been named “Bronze Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Commvault is a leading provider of data protection and information management...
As AT&Ts VP of Domain 2.0 architecture writes one aspect of their Domain 2.0 strategy is a goal to embrace a Microservices Application Architecture. One page 9 they describe how these envisage them fitting into the ECOMP architecture: "The initial steps of the recipes include a homing and placement task using constraints specified in the requests. ‘Homing and Placement' are micro-services involving orchestration, inventory, and controllers responsible for infrastructure, network, and applicati...
Many banks and financial institutions are experimenting with containers in development environments, but when will they move into production? Containers are seen as the key to achieving the ultimate in information technology flexibility and agility. Containers work on both public and private clouds, and make it easy to build and deploy applications. The challenge for regulated industries is the cost and complexity of container security compliance. VM security compliance is already challenging, ...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, wh...
SYS-CON Events announced today that Tintri Inc., a leading producer of VM-aware storage (VAS) for virtualization and cloud environments, will exhibit at the 18th International CloudExpo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, New York, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Just last week a senior Hybris consultant shared the story of a customer engagement on which he was working. This customer had problems, serious problems. We’re talking about response times far beyond the most liberal acceptable standard. They were unable to solve the issue in their eCommerce platform – specifically Hybris. Although the eCommerce project was delivered by a system integrator / implementation partner, the vendor still gets involved when things go really wrong. After all, the vendo...