Click here to close now.




















Welcome!

Microservices Expo Authors: Carmen Gonzalez, Elizabeth White, Yeshim Deniz, Pat Romanski, Liz McMillan

Related Topics: SYS-CON BRASIL, Industrial IoT, Microservices Expo

SYS-CON BRASIL: Article

i-Technology Viewpoint: The Performance Woe of Binary XML

XML Overhead – The Scapegoat's Story

[This article was our best-read XML-related article in 2006.]

Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store. While most agree that verbosity is inherent to XML's way of encoding information (e.g., extensive use of tags and pointy brackets), the explanation of XML's perceived performance issue remains inconclusive. A popular belief is that since XML is human-readable text, it has to be slow and inefficient. And by the same token, proponents of binary XML seem to suggest that a compact encoding format, most noticeably the binary XML, would automatically lead to better processing performance.

Does it make sense for doctors to prescribe medicine without a diagnosis? Whether those perceptions and beliefs have a grain of truth or not, one thing is certain: Without a solid understanding of XML's performance issue, it will be difficult, if not impossible, to devise meaningful solutions. So in this article, I'll attempt to dissect XML's performance issue by focusing on three key questions: (1) Does XML have a performance issue? (2) What is the real culprit behind XML's slow performance? (3) Can binary XML fundamentally solve the problem.

Performance Is Not an XML Issue Per Se
In networking system design, the OSI (Open System Interconnect) stack is the standard model that divides the functions of the network into seven layers. Each layer only uses the layer below and only exports functionalities to the layer above. Compared with monolithic approaches, the advantages of OSI's layered approach are the robustness, resilience, and extensibility of the networking system in the face of rapid technology evolution. For example, any Voice over IP applications will work without knowing the physical layer of the networks (e.g., using copper, fiber cable, or Wi-Fi), or the data link layer (e.g., Ethernet, Frame Relay, or ATM).

Likewise, we can take a similar layered approach to modeling XML-based applications. Figure 1 is a simplified view of this "XML protocol stack" consisting of three layers: the XML data layer, the XML parsing layer, and the application layer. The application layer only uses functions exported by the XML parsing layer, which translates the data from its physical representation (XML) into its logical representation (the infoset).

Several observations can be made concerning the perceived performance issue of XML. First and foremost, because the XML application can only go as fast as the XML parsers can process XML messages, performance is actually not an issue of XML per se, but instead an issue of the XML parsing layer. If an XML routing application (assuming minimum overhead at the application layer) can't keep pace with incoming XML messages at a gigabit per second, it's most likely because the throughput of XML parsing is less than that of the network. Consequently, the correct way to boost the application performance is to optimize the XML parsing layer. Just like tuning any software applications, the best way to do it is to discover then find ways to reduce or eliminate the overhead in XML parsers.

Object Allocation - the Real Culprit To get a feel of the performance offered by current XML parsing technologies, I benchmarked the parsing throughput of the two types of widely used XML parsers - Xerces DOM and Xerces SAX. The benchmark applications are quite simple. They first read an XML document into memory then invoke parser routines to parse the document for a large number of iterations and the parsing throughput is calculated by dividing the file size by the average latency of each parsing iteration. For SAX parsing, the content handler is set to NULL. Several XML document in varying sizes, tagginess, and structural complexity were chosen for the benchmark. The results, produced by a two-year old 1.7GHz Pentium M laptop, are summarized in Figure 2. The complete report, which includes test setup, methodology, and code, is available online at http://vtd-xml.sf.net/benchmark.html.

The benchmark results are quite consistent with the well-known performance characteristics of DOM and SAX. First of all, the raw parsing throughput of SAX, at between 20MB/sec~30MB/sec, is actually quite respectable. However, by not exposing the inherent structural information, SAX essentially treats XML as CSV with "pointy brackets," often making it prohibitively difficult to use and unsuitable as a general-purpose XML parser. DOM, on the other hand, lets developers navigate in-memory tree structure. But the benchmark results also show that, except for very small files, DOM is typically three to five times slower than SAX. Because DOM parsers internally usually use SAX to tokenize XML, by comparing the performance differences, it's clear that building the in-memory tree structure is where the bottleneck is. In other words, allocating all the objects and connecting them together dramatically slow everything down. As the next step, I ran JProfiler (from EJ-technology) to identify where DOM and SAX parsing spend all the CPU cycles. The results confirmed my early suspicion that the overhead of object allocation overwhelmingly bottlenecks DOM parsing and, to a lesser (but still significant) degree, SAX parsing as well.

Some readers may question that DOM - only an API specification - doesn't preclude efficient, less object-intensive, implementations. Not so. The DOM spec is in fact based entirely on the assumption that the hierarchical structure consists entirely of objects exposing the Node interface. The most any DOM implementation can do is alter the implementation of the object sitting behind the Node interface, and it's impossible to rip away the objects altogether. So, if the object creation is the main culprit, the DOM spec itself is the accomplice that makes any performance-oriented optimizations prohibitively difficult. And this is why, after the past eight years and countless efforts by all major IT companies, every implementation of DOM has only seen marginal performance improvement.

Binary XML Solves the Wrong Problem
Can binary XML fundamentally solve the performance issue of XML? Well, since the performance issue belongs to XML parsers, a better question to ask is whether binary XML can help make parsing more efficient. The answer has two parts - one for SAX, the other for DOM.

Binary XML can improve the raw SAX parsing speed. There are a lot of XML-specific syntax features that binary XML can choose not to inherit. For example, binary XML can replace the ending tags with something more efficient, entirely avoiding attributes so SAX parsing no longer does uniqueness checking, or find other ways to represent the document structure. There are many ways to trim CPU cycles off SAX's tokenization cost. And proponents of binary XML often cites up to a 10x speedup for the binary XML version of SAX over text XML.

But they ignored the simple fact that SAX has serious usability issues. The awkward forward-only nature of SAX parsing not only requires extra implementation effort, but also incurs performance penalties when the document structure becomes only slightly complex. If developers choose not to scan the document multiple times, they'll have to buffer the document or build custom object models. In addition, SAX doesn't work well with XPath, and in general can't drive XSLT processing (binary XML has to be transformed as well, right?). So pointing to SAX's raw performance for binary XML as a proof of binary XML's merit is both unfair and misleading. The bottom line: for a XML processing model to be broadly useful, the API must expose the inherent structure of XML.

Unfortunately, binary XML won't make much difference improving DOM parsing for the simple reason that DOM parsing generally spends most CPU cycles building in-memory tree structure, not on tokenization. So the speedup of DOM due to faster SAX parsing is quite limited. In other words, DOM parsing for binary XML will be slow as well. And going one step further, object-graph based parser will have the same kind of performance issue for virtually any data format such as DCOM, RMI, or CORBA. XML is merely the scapegoat.

Reduce Object Creation - the Correct Approach
From my benchmark and profiling results, it's quite easy to see that the best way to remove the performance bottleneck in XML parsing is to reduce the object-creation cost of building the in-memory hierarchical structure. And the possibilities are, in fact, endless, and only limited by the imagination. Good solutions can and will emerge. And among them is VTD-XML (http://vtd-xml.sf.net). To achieve high performance, VTD-XML approaches XML parsing through the two object-less steps: (1) non-extractive tokenization and (2) hierarchical-directory-based random access. The result: VTD-XML drastically reduces the number of objects while still exporting the hierarchical structure to application developers and significantly outperforming SAX. (See Figure 3)

Summary
To summarize, the right way is to find better parsing techniques beyond DOM and SAX that significantly reduce the object-creation cost of building XML's tree structure. Binary XML won't fundamentally solve XML's performance issue because the problem belongs to XML parsers, not XML.

More Stories By Jimmy Zhang

Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.

Comments (10) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
RTL 09/08/06 10:44:16 AM EDT

Sure the network may be "fast enough" from a raw mbps perspective. The problem is, in a real-world scenario your application may not be the only user of that bandwidth. And if everyone is stuffing the network full of data bloated with excessive overhead (such as what XML specifies), you can choke the network. It's not rocket science -- cut the data-bloat on the network in half and you double the capacity/performance. Even if the network is not running at full capacity, cutting the traffic in half (via more efficient data transport protocols) makes everything run more smoothly and doubles the lifetime of the existing infrastructure (i.e., it will take twice as long to fill up), ultimately reducing cost.

jzhang 09/07/06 09:46:36 PM EDT

This article is entirely about parsing performance, not the size of XML... the problem is that parsing can't keep up with the speed of the network...
XmL is not slow, *xml parsers* are slow

RTL 09/07/06 03:32:13 PM EDT

This article seems to miss the point of the performance critism of XML. The problem is not so much one of parsing (although that is an issue), but network bandwidth. From a bandwith perspective, XML is just about the world's most inefficient protocol one could devise for transmitting data. If binary XML could cut the file size even just in half, that doubles an applications network performance.

There is no reason why you could not perform the same job as text-XML with binary-XML. You would gain significant performance benefits with the only downside being that you lose immediate human readability. You know... sometimes it seems that XML was embraced and championed by a lot of young, wet-behind-the-ears HTML hackers who didn't know how to read hex. :)

alucinor 08/31/06 05:32:15 PM EDT

> queZZtion commented on the 31 Aug 2006:
> MSFT submitted OpenXML to ECMA, anyone
> know if they plan to submit it to ISO too?

Microsoft's Open XML is just a delay tactic -- their old strategy of vaporware vaporware vaporware ... that sometimes materializes at the last second, never as grand as promised, but having accomplished its goal of causing everyone to say "Let's wait and see what Microsoft will do first!"

And MOX is Latin for "soon". Coincidence?!

queZZtion 08/31/06 05:30:54 PM EDT

MSFT submitted OpenXML to ECMA, anyone know if they plan to submit it to ISO too?

An0n 08/31/06 05:29:34 PM EDT

HR-XML Anyone? View link: http://www.hr-xml.org/

Jimmy Zhang 08/02/06 09:57:28 PM EDT

Responding to your comment #1

It doesn't see that we are disagreeing, because if the CPU is devoting much cycles on application logic, then there is less incentive going to binary XML with the hope of speeding up overall app performance

Concerning your comment #2, built-in indexing is not just for binary XML, it can be done for XML as well, so this argument is quite weak... or do I misintepret anything??

Henrik 07/31/06 04:56:45 AM EDT

I don't quite buy your argument.

1) You are only looking at a single process assuming that the CPU has nothing better to do than parsing and processing. On a server any performance improvement will help server throughput.

2) The main benefit of a binary format would be built in indexing. If done properly DOM wouldn't have to build much of a structure at all but rather work directly on the binary image and extract nodes on request.

3) I don't really see a SAX parser getting a drastic improvement though.

XML News Desk 07/30/06 11:37:30 AM EDT

Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store. While most agree that verbosity is inherent to XML's way of encoding information (e.g., extensive use of tags and pointy brackets), the explanation of XML's perceived performance issue remains inconclusive. A popular belief is that since XML is human-readable text, it has to be slow and inefficient. And by the same token, proponents of binary XML seem to suggest that a compact encoding format, most noticeably the binary XML, would automatically lead to better processing performance.

XML News Desk 07/30/06 08:54:25 AM EDT

Since its inception, XML has been criticized for the overhead it introduces into the enterprise infrastructure. Business data encoded in XML takes five to 10 times more bandwidth to transmit in the network and proportionally more disk space to store. While most agree that verbosity is inherent to XML's way of encoding information (e.g., extensive use of tags and pointy brackets), the explanation of XML's perceived performance issue remains inconclusive. A popular belief is that since XML is human-readable text, it has to be slow and inefficient. And by the same token, proponents of binary XML seem to suggest that a compact encoding format, most noticeably the binary XML, would automatically lead to better processing performance.

@MicroservicesExpo Stories
Containers are not new, but renewed commitments to performance, flexibility, and agility have propelled them to the top of the agenda today. By working without the need for virtualization and its overhead, containers are seen as the perfect way to deploy apps and services across multiple clouds. Containers can handle anything from file types to operating systems and services, including microservices. What are microservices? Unlike what the name implies, microservices are not necessarily small,...
Any Ops team trying to support a company in today’s cloud-connected world knows that a new way of thinking is required – one just as dramatic than the shift from Ops to DevOps. The diversity of modern operations requires teams to focus their impact on breadth vs. depth. In his session at DevOps Summit, Adam Serediuk, Director of Operations at xMatters, Inc., will discuss the strategic requirements of evolving from Ops to DevOps, and why modern Operations has begun leveraging the “NoOps” approa...
17th Cloud Expo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterprises ar...
ElasticBox, the agile application delivery manager, announced freely available public boxes for the DevOps community. ElasticBox works with enterprises to help them deploy any application to any cloud. Public boxes are curated reference boxes that represent some of the most popular applications and tools for orchestrating deployments at scale. Boxes are an adaptive way to represent reusable infrastructure as components of code. Boxes contain scripts, variables, and metadata to automate proces...
SYS-CON Events announced today that the "Second Containers & Microservices Expo" will take place November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities.
DevOps Summit, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development...
Akana has announced the availability of the new Akana Healthcare Solution. The API-driven solution helps healthcare organizations accelerate their transition to being secure, digitally interoperable businesses. It leverages the Health Level Seven International Fast Healthcare Interoperability Resources (HL7 FHIR) standard to enable broader business use of medical data. Akana developed the Healthcare Solution in response to healthcare businesses that want to increase electronic, multi-device acce...
In his session at 17th Cloud Expo, Ernest Mueller, Product Manager at Idera, will explain the best practices and lessons learned for tracking and optimizing costs while delivering a cloud-hosted service. He will describe a DevOps approach where the applications and systems work together to track usage, model costs in a granular fashion, and make smart decisions at runtime to minimize costs. The trickier parts covered include triggering off the right metrics; balancing resilience and redundancy ...
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo in Silicon Valley. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading in...
The 17th International Cloud Expo has announced that its Call for Papers is open. 17th International Cloud Expo, to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, APM, APIs, Microservices, Security, Big Data, Internet of Things, DevOps and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding bu...
The 5th International DevOps Summit, co-located with 17th International Cloud Expo – being held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the ...
Early in my DevOps Journey, I was introduced to a book of great significance circulating within the Web Operations industry titled The Phoenix Project. (You can read our review of Gene’s book, if interested.) Written as a novel and loosely based on many of the same principles explored in The Goal, this book has been read and referenced by many who have adopted DevOps into their continuous improvement and software delivery processes around the world. As I began planning my travel schedule last...
Skeuomorphism usually means retaining existing design cues in something new that doesn’t actually need them. However, the concept of skeuomorphism can be thought of as relating more broadly to applying existing patterns to new technologies that, in fact, cry out for new approaches. In his session at DevOps Summit, Gordon Haff, Senior Cloud Strategy Marketing and Evangelism Manager at Red Hat, discussed why containers should be paired with new architectural practices such as microservices rathe...
SYS-CON Events announced today the Containers & Microservices Bootcamp, being held November 3-4, 2015, in conjunction with 17th Cloud Expo, @ThingsExpo, and @DevOpsSummit at the Santa Clara Convention Center in Santa Clara, CA. This is your chance to get started with the latest technology in the industry. Combined with real-world scenarios and use cases, the Containers and Microservices Bootcamp, led by Janakiram MSV, a Microsoft Regional Director, will include presentations as well as hands-on...
SYS-CON Events announced today that Pythian, a global IT services company specializing in helping companies leverage disruptive technologies to optimize revenue-generating systems, has been named “Bronze Sponsor” of SYS-CON's 17th Cloud Expo, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Founded in 1997, Pythian is a global IT services company that helps companies compete by adopting disruptive technologies such as cloud, Big Data, advance...
DevOps has traditionally played important roles in development and IT operations, but the practice is quickly becoming core to other business functions such as customer success, business intelligence, and marketing analytics. Modern marketers today are driven by data and rely on many different analytics tools. They need DevOps engineers in general and server log data specifically to do their jobs well. Here’s why: Server log files contain the only data that is completely full and accurate in th...
Several years ago, I was a developer in a travel reservation aggregator. Our mission was to pull flight and hotel data from a bunch of cryptic reservation platforms, and provide it to other companies via an API library - for a fee. That was before companies like Expedia standardized such things. We started with simple methods like getFlightLeg() or addPassengerName(), each performing a small, well-understood function. But our customers wanted bigger, more encompassing services that would "do ...
Culture is the most important ingredient of DevOps. The challenge for most organizations is defining and communicating a vision of beneficial DevOps culture for their organizations, and then facilitating the changes needed to achieve that. Often this comes down to an ability to provide true leadership. As a CIO, are your direct reports IT managers or are they IT leaders? The hard truth is that many IT managers have risen through the ranks based on their technical skills, not their leadership ab...
Whether you like it or not, DevOps is on track for a remarkable alliance with security. The SEC didn’t approve the merger. And your boss hasn’t heard anything about it. Yet, this unruly triumvirate will soon dominate and deliver DevSecOps faster, cheaper, better, and on an unprecedented scale. In his session at DevOps Summit, Frank Bunger, VP of Customer Success at ScriptRock, will discuss how this cathartic moment will propel the DevOps movement from such stuff as dreams are made on to a prac...
SYS-CON Events announced today that HPM Networks will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. For 20 years, HPM Networks has been integrating technology solutions that solve complex business challenges. HPM Networks has designed solutions for both SMB and enterprise customers throughout the San Francisco Bay Area.