Welcome!

Microservices Expo Authors: Pat Romanski, Elizabeth White, Derek Weeks, Mano Marks, Liz McMillan

Related Topics: SYS-CON BRASIL, Industrial IoT, Microservices Expo

SYS-CON BRASIL: Article

i-Technology Viewpoint: The Performance Woe of Binary XML

XML Overhead – The Scapegoat's Story

[This article was our best-read XML-related article in 2006.]

Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store. While most agree that verbosity is inherent to XML's way of encoding information (e.g., extensive use of tags and pointy brackets), the explanation of XML's perceived performance issue remains inconclusive. A popular belief is that since XML is human-readable text, it has to be slow and inefficient. And by the same token, proponents of binary XML seem to suggest that a compact encoding format, most noticeably the binary XML, would automatically lead to better processing performance.

Does it make sense for doctors to prescribe medicine without a diagnosis? Whether those perceptions and beliefs have a grain of truth or not, one thing is certain: Without a solid understanding of XML's performance issue, it will be difficult, if not impossible, to devise meaningful solutions. So in this article, I'll attempt to dissect XML's performance issue by focusing on three key questions: (1) Does XML have a performance issue? (2) What is the real culprit behind XML's slow performance? (3) Can binary XML fundamentally solve the problem.

Performance Is Not an XML Issue Per Se
In networking system design, the OSI (Open System Interconnect) stack is the standard model that divides the functions of the network into seven layers. Each layer only uses the layer below and only exports functionalities to the layer above. Compared with monolithic approaches, the advantages of OSI's layered approach are the robustness, resilience, and extensibility of the networking system in the face of rapid technology evolution. For example, any Voice over IP applications will work without knowing the physical layer of the networks (e.g., using copper, fiber cable, or Wi-Fi), or the data link layer (e.g., Ethernet, Frame Relay, or ATM).

Likewise, we can take a similar layered approach to modeling XML-based applications. Figure 1 is a simplified view of this "XML protocol stack" consisting of three layers: the XML data layer, the XML parsing layer, and the application layer. The application layer only uses functions exported by the XML parsing layer, which translates the data from its physical representation (XML) into its logical representation (the infoset).

Several observations can be made concerning the perceived performance issue of XML. First and foremost, because the XML application can only go as fast as the XML parsers can process XML messages, performance is actually not an issue of XML per se, but instead an issue of the XML parsing layer. If an XML routing application (assuming minimum overhead at the application layer) can't keep pace with incoming XML messages at a gigabit per second, it's most likely because the throughput of XML parsing is less than that of the network. Consequently, the correct way to boost the application performance is to optimize the XML parsing layer. Just like tuning any software applications, the best way to do it is to discover then find ways to reduce or eliminate the overhead in XML parsers.

Object Allocation - the Real Culprit To get a feel of the performance offered by current XML parsing technologies, I benchmarked the parsing throughput of the two types of widely used XML parsers - Xerces DOM and Xerces SAX. The benchmark applications are quite simple. They first read an XML document into memory then invoke parser routines to parse the document for a large number of iterations and the parsing throughput is calculated by dividing the file size by the average latency of each parsing iteration. For SAX parsing, the content handler is set to NULL. Several XML document in varying sizes, tagginess, and structural complexity were chosen for the benchmark. The results, produced by a two-year old 1.7GHz Pentium M laptop, are summarized in Figure 2. The complete report, which includes test setup, methodology, and code, is available online at http://vtd-xml.sf.net/benchmark.html.

The benchmark results are quite consistent with the well-known performance characteristics of DOM and SAX. First of all, the raw parsing throughput of SAX, at between 20MB/sec~30MB/sec, is actually quite respectable. However, by not exposing the inherent structural information, SAX essentially treats XML as CSV with "pointy brackets," often making it prohibitively difficult to use and unsuitable as a general-purpose XML parser. DOM, on the other hand, lets developers navigate in-memory tree structure. But the benchmark results also show that, except for very small files, DOM is typically three to five times slower than SAX. Because DOM parsers internally usually use SAX to tokenize XML, by comparing the performance differences, it's clear that building the in-memory tree structure is where the bottleneck is. In other words, allocating all the objects and connecting them together dramatically slow everything down. As the next step, I ran JProfiler (from EJ-technology) to identify where DOM and SAX parsing spend all the CPU cycles. The results confirmed my early suspicion that the overhead of object allocation overwhelmingly bottlenecks DOM parsing and, to a lesser (but still significant) degree, SAX parsing as well.

Some readers may question that DOM - only an API specification - doesn't preclude efficient, less object-intensive, implementations. Not so. The DOM spec is in fact based entirely on the assumption that the hierarchical structure consists entirely of objects exposing the Node interface. The most any DOM implementation can do is alter the implementation of the object sitting behind the Node interface, and it's impossible to rip away the objects altogether. So, if the object creation is the main culprit, the DOM spec itself is the accomplice that makes any performance-oriented optimizations prohibitively difficult. And this is why, after the past eight years and countless efforts by all major IT companies, every implementation of DOM has only seen marginal performance improvement.

Binary XML Solves the Wrong Problem
Can binary XML fundamentally solve the performance issue of XML? Well, since the performance issue belongs to XML parsers, a better question to ask is whether binary XML can help make parsing more efficient. The answer has two parts - one for SAX, the other for DOM.

Binary XML can improve the raw SAX parsing speed. There are a lot of XML-specific syntax features that binary XML can choose not to inherit. For example, binary XML can replace the ending tags with something more efficient, entirely avoiding attributes so SAX parsing no longer does uniqueness checking, or find other ways to represent the document structure. There are many ways to trim CPU cycles off SAX's tokenization cost. And proponents of binary XML often cites up to a 10x speedup for the binary XML version of SAX over text XML.

But they ignored the simple fact that SAX has serious usability issues. The awkward forward-only nature of SAX parsing not only requires extra implementation effort, but also incurs performance penalties when the document structure becomes only slightly complex. If developers choose not to scan the document multiple times, they'll have to buffer the document or build custom object models. In addition, SAX doesn't work well with XPath, and in general can't drive XSLT processing (binary XML has to be transformed as well, right?). So pointing to SAX's raw performance for binary XML as a proof of binary XML's merit is both unfair and misleading. The bottom line: for a XML processing model to be broadly useful, the API must expose the inherent structure of XML.

Unfortunately, binary XML won't make much difference improving DOM parsing for the simple reason that DOM parsing generally spends most CPU cycles building in-memory tree structure, not on tokenization. So the speedup of DOM due to faster SAX parsing is quite limited. In other words, DOM parsing for binary XML will be slow as well. And going one step further, object-graph based parser will have the same kind of performance issue for virtually any data format such as DCOM, RMI, or CORBA. XML is merely the scapegoat.

Reduce Object Creation - the Correct Approach
From my benchmark and profiling results, it's quite easy to see that the best way to remove the performance bottleneck in XML parsing is to reduce the object-creation cost of building the in-memory hierarchical structure. And the possibilities are, in fact, endless, and only limited by the imagination. Good solutions can and will emerge. And among them is VTD-XML (http://vtd-xml.sf.net). To achieve high performance, VTD-XML approaches XML parsing through the two object-less steps: (1) non-extractive tokenization and (2) hierarchical-directory-based random access. The result: VTD-XML drastically reduces the number of objects while still exporting the hierarchical structure to application developers and significantly outperforming SAX. (See Figure 3)

Summary
To summarize, the right way is to find better parsing techniques beyond DOM and SAX that significantly reduce the object-creation cost of building XML's tree structure. Binary XML won't fundamentally solve XML's performance issue because the problem belongs to XML parsers, not XML.

More Stories By Jimmy Zhang

Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.

Comments (10) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
RTL 09/08/06 10:44:16 AM EDT

Sure the network may be "fast enough" from a raw mbps perspective. The problem is, in a real-world scenario your application may not be the only user of that bandwidth. And if everyone is stuffing the network full of data bloated with excessive overhead (such as what XML specifies), you can choke the network. It's not rocket science -- cut the data-bloat on the network in half and you double the capacity/performance. Even if the network is not running at full capacity, cutting the traffic in half (via more efficient data transport protocols) makes everything run more smoothly and doubles the lifetime of the existing infrastructure (i.e., it will take twice as long to fill up), ultimately reducing cost.

jzhang 09/07/06 09:46:36 PM EDT

This article is entirely about parsing performance, not the size of XML... the problem is that parsing can't keep up with the speed of the network...
XmL is not slow, *xml parsers* are slow

RTL 09/07/06 03:32:13 PM EDT

This article seems to miss the point of the performance critism of XML. The problem is not so much one of parsing (although that is an issue), but network bandwidth. From a bandwith perspective, XML is just about the world's most inefficient protocol one could devise for transmitting data. If binary XML could cut the file size even just in half, that doubles an applications network performance.

There is no reason why you could not perform the same job as text-XML with binary-XML. You would gain significant performance benefits with the only downside being that you lose immediate human readability. You know... sometimes it seems that XML was embraced and championed by a lot of young, wet-behind-the-ears HTML hackers who didn't know how to read hex. :)

alucinor 08/31/06 05:32:15 PM EDT

> queZZtion commented on the 31 Aug 2006:
> MSFT submitted OpenXML to ECMA, anyone
> know if they plan to submit it to ISO too?

Microsoft's Open XML is just a delay tactic -- their old strategy of vaporware vaporware vaporware ... that sometimes materializes at the last second, never as grand as promised, but having accomplished its goal of causing everyone to say "Let's wait and see what Microsoft will do first!"

And MOX is Latin for "soon". Coincidence?!

queZZtion 08/31/06 05:30:54 PM EDT

MSFT submitted OpenXML to ECMA, anyone know if they plan to submit it to ISO too?

An0n 08/31/06 05:29:34 PM EDT

HR-XML Anyone? View link: http://www.hr-xml.org/

Jimmy Zhang 08/02/06 09:57:28 PM EDT

Responding to your comment #1

It doesn't see that we are disagreeing, because if the CPU is devoting much cycles on application logic, then there is less incentive going to binary XML with the hope of speeding up overall app performance

Concerning your comment #2, built-in indexing is not just for binary XML, it can be done for XML as well, so this argument is quite weak... or do I misintepret anything??

Henrik 07/31/06 04:56:45 AM EDT

I don't quite buy your argument.

1) You are only looking at a single process assuming that the CPU has nothing better to do than parsing and processing. On a server any performance improvement will help server throughput.

2) The main benefit of a binary format would be built in indexing. If done properly DOM wouldn't have to build much of a structure at all but rather work directly on the binary image and extract nodes on request.

3) I don't really see a SAX parser getting a drastic improvement though.

XML News Desk 07/30/06 11:37:30 AM EDT

Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store. While most agree that verbosity is inherent to XML's way of encoding information (e.g., extensive use of tags and pointy brackets), the explanation of XML's perceived performance issue remains inconclusive. A popular belief is that since XML is human-readable text, it has to be slow and inefficient. And by the same token, proponents of binary XML seem to suggest that a compact encoding format, most noticeably the binary XML, would automatically lead to better processing performance.

XML News Desk 07/30/06 08:54:25 AM EDT

Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store. While most agree that verbosity is inherent to XML's way of encoding information (e.g., extensive use of tags and pointy brackets), the explanation of XML's perceived performance issue remains inconclusive. A popular belief is that since XML is human-readable text, it has to be slow and inefficient. And by the same token, proponents of binary XML seem to suggest that a compact encoding format, most noticeably the binary XML, would automatically lead to better processing performance.

@MicroservicesExpo Stories
DevOps tends to focus on the relationship between Dev and Ops, putting an emphasis on the ops and application infrastructure. But that’s changing with microservices architectures. In her session at DevOps Summit, Lori MacVittie, Evangelist for F5 Networks, will focus on how microservices are changing the underlying architectures needed to scale, secure and deliver applications based on highly distributed (micro) services and why that means an expansion into “the network” for DevOps.
In his session at 19th Cloud Expo, Claude Remillard, Principal Program Manager in Developer Division at Microsoft, contrasted how his team used config as code and immutable patterns for continuous delivery of microservices and apps to the cloud. He showed how the immutable patterns helps developers do away with most of the complexity of config as code-enabling scenarios such as rollback, zero downtime upgrades with far greater simplicity. He also demoed building immutable pipelines in the cloud ...
Using new techniques of information modeling, indexing, and processing, new cloud-based systems can support cloud-based workloads previously not possible for high-throughput insurance, banking, and case-based applications. In his session at 18th Cloud Expo, John Newton, CTO, Founder and Chairman of Alfresco, described how to scale cloud-based content management repositories to store, manage, and retrieve billions of documents and related information with fast and linear scalability. He addres...
Hardware virtualization and cloud computing allowed us to increase resource utilization and increase our flexibility to respond to business demand. Docker Containers are the next quantum leap - Are they?! Databases always represented an additional set of challenges unique to running workloads requiring a maximum of I/O, network, CPU resources combined with data locality.
True Story. Over the past few years, Fannie Mae transformed the way in which they delivered software. Deploys increased from 1,200/month to 15,000/month. At the same time, productivity increased by 28% while reducing costs by 30%. But, how did they do it? During the All Day DevOps conference, over 13,500 practitioners from around the world to learn from their peers in the industry. Barry Snyder, Senior Manager of DevOps at Fannie Mae, was one of 57 practitioners who shared his real world journe...
Containers have changed the mind of IT in DevOps. They enable developers to work with dev, test, stage and production environments identically. Containers provide the right abstraction for microservices and many cloud platforms have integrated them into deployment pipelines. DevOps and containers together help companies achieve their business goals faster and more effectively. In his session at DevOps Summit, Ruslan Synytsky, CEO and Co-founder of Jelastic, reviewed the current landscape of Dev...
When building DevOps or continuous delivery practices you can learn a great deal from others. What choices did they make, what practices did they put in place, and how did they connect the dots? At Sonatype, we pulled together a set of 21 reference architectures for folks building continuous delivery and DevOps practices using Docker. Why? After 3,000 DevOps professionals attended our webinar on "Continuous Integration using Docker" discussing just one reference architecture example, we recogn...
2016 has been an amazing year for Docker and the container industry. We had 3 major releases of Docker engine this year , and tremendous increase in usage. The community has been following along and contributing amazing Docker resources to help you learn and get hands-on experience. Here’s some of the top read and viewed content for the year. Of course releases are always really popular, particularly when they fit requests we had from the community.
In their general session at 16th Cloud Expo, Michael Piccininni, Global Account Manager - Cloud SP at EMC Corporation, and Mike Dietze, Regional Director at Windstream Hosted Solutions, reviewed next generation cloud services, including the Windstream-EMC Tier Storage solutions, and discussed how to increase efficiencies, improve service delivery and enhance corporate cloud solution development. Michael Piccininni is Global Account Manager – Cloud SP at EMC Corporation. He has been engaged in t...
You have great SaaS business app ideas. You want to turn your idea quickly into a functional and engaging proof of concept. You need to be able to modify it to meet customers' needs, and you need to deliver a complete and secure SaaS application. How could you achieve all the above and yet avoid unforeseen IT requirements that add unnecessary cost and complexity? You also want your app to be responsive in any device at any time. In his session at 19th Cloud Expo, Mark Allen, General Manager of...
All organizations that did not originate this moment have a pre-existing culture as well as legacy technology and processes that can be more or less amenable to DevOps implementation. That organizational culture is influenced by the personalities and management styles of Executive Management, the wider culture in which the organization is situated, and the personalities of key team members at all levels of the organization. This culture and entrenched interests usually throw a wrench in the work...
In his General Session at DevOps Summit, Asaf Yigal, Co-Founder & VP of Product at Logz.io, will explore the value of Kibana 4 for log analysis and will give a real live, hands-on tutorial on how to set up Kibana 4 and get the most out of Apache log files. He will examine three use cases: IT operations, business intelligence, and security and compliance. This is a hands-on session that will require participants to bring their own laptops, and we will provide the rest.
In his General Session at 16th Cloud Expo, David Shacochis, host of The Hybrid IT Files podcast and Vice President at CenturyLink, investigated three key trends of the “gigabit economy" though the story of a Fortune 500 communications company in transformation. Narrating how multi-modal hybrid IT, service automation, and agile delivery all intersect, he will cover the role of storytelling and empathy in achieving strategic alignment between the enterprise and its information technology.
"We're bringing out a new application monitoring system to the DevOps space. It manages large enterprise applications that are distributed throughout a node in many enterprises and we manage them as one collective," explained Kevin Barnes, President of eCube Systems, in this SYS-CON.tv interview at DevOps at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Buzzword alert: Microservices and IoT at a DevOps conference? What could possibly go wrong? In this Power Panel at DevOps Summit, moderated by Jason Bloomberg, the leading expert on architecting agility for the enterprise and president of Intellyx, panelists peeled away the buzz and discuss the important architectural principles behind implementing IoT solutions for the enterprise. As remote IoT devices and sensors become increasingly intelligent, they become part of our distributed cloud enviro...
The evolution of JavaScript and HTML 5 to support a genuine component based framework (Web Components) with the necessary tools to deliver something close to a native experience including genuine realtime networking (UDP using WebRTC). HTML5 is evolving to offer built in templating support, the ability to watch objects (which will speed up Angular) and Web Components (which offer Angular Directives). The native level support will offer a massive performance boost to frameworks having to fake all...
@DevOpsSummit at Cloud taking place June 6-8, 2017, at Javits Center, New York City, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long developm...
As the race for the presidency heats up, IT leaders would do well to recall the famous catchphrase from Bill Clinton’s successful 1992 campaign against George H. W. Bush: “It’s the economy, stupid.” That catchphrase is important, because IT economics are important. Especially when it comes to cloud. Application performance management (APM) for the cloud may turn out to be as much about those economics as it is about customer experience.
When you focus on a journey from up-close, you look at your own technical and cultural history and how you changed it for the benefit of the customer. This was our starting point: too many integration issues, 13 SWP days and very long cycles. It was evident that in this fast-paced industry we could no longer afford this reality. We needed something that would take us beyond reducing the development lifecycles, CI and Agile methodologies. We made a fundamental difference, even changed our culture...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...