Click here to close now.

Welcome!

Microservices Journal Authors: Pat Romanski, Elizabeth White, Carmen Gonzalez, Liz McMillan, Roger Strukhoff

Related Topics: Microservices Journal, .NET, Open Source

Microservices Journal: Article

Making Sense of Large and Growing Data Volumes

MapReduce won’t overtake the enterprise data warehouse industry anytime soon

Is MapReduce the Holy Grail answer to the pressing problem of processing, analyzing and making sense of large and growing data volumes? Certainly it has potential in this arena, but there is a distressing gap between the amount of hype this technology - and its spinoffs - has received and the number of professionals who actually know how to integrate and make best use of it.

Industry watchers say it's just a matter of time before MapReduce sweeps through the enterprise data warehouse (EDW) market the same way open source technologies like Linux have done. In fact, in a recent blog post, Forrester's James Kobielus proclaimed that most EDW vendors will incorporate support for MapReduce's open source cousin Hadoop into the heart of their architectures to enable open, standards-based data analytics on massive amounts of data.

So, no more databases, just MapReduce? I'm not so sure. But don't misunderstand. It's not that MapReduce isn't an effective way to analyze data in some cases. The big names in Internet business are all using it - Facebook, Google, Amazon, eBay et al - so it must be good, right? But it's worth taking a more measured view based both on the technical and the practical business merits. I believe that the two technologies are not so mutually exclusive; that they will work hand-in-hand and, in some cases, MapReduce will be integrated into the relational database (RDBMS).

Google certainly has proven that MapReduce excels at making sense out of the exabytes of unstructured data on the web, which it should, given that MapReduce was designed from the outset for manipulating very large data sets. MapReduce in this sense provides a way to put structure around unstructured data. We humans prefer structure; it's in our DNA. Without structure, we have no real way of adding value to the data. Unstructured data analytics is something of an oxymoron for a pattern-seeking hominid.

MapReduce helps us put structure around the unstructured so we can then make sense of it. It creates an environment wherein a data analyst can write two simple functions, a "mapper" and a "reducer," to perform the actual data manipulation, returning a result that is at once both an analysis of the data it has just mapped and summarized, as well as the structure for further analysis that will help provide insight into the data. Whether that further analysis is done in a MapReduce environment might be the more appropriate question.

From an infrastructure standpoint, MapReduce excels where performance and scalability are challenges. Applications written using the MapReduce framework are automatically parallelized, making it well suited to a large infrastructure of connected machines. As it scales applications across lots of servers made up of lots of nodes, the MapReduce framework also provides built-in query fault tolerance so that whatever hardware component might fail, a query would be completed by another machine. Further, MapReduce and its open source brethren can perform functions not possible in standard SQL (click-stream sessionization, nPath, graph production of potentially unbounded length in SQL).

What's not to love? At a basic level I believe the MapReduce framework is an inefficient way of analyzing data for the vast majority of businesses. The aforementioned capabilities of MapReduce are all well and good, provided you have a Google-like business replete with legions of programmers and vast amounts of server and memory capacity. Viewed from this perspective, it makes perfect sense that Google developed and used MapReduce: because it could. It had a huge and growing resource in its farms of custom-made servers, as well as armies of programmers constantly looking for new ways to take advantage of that seemingly infinite hardware (and the data collected on it), to do cool new things.

Similarly, the other high-profile adopters and advocates are also IT-savvy, IT-heavy companies and, like Google, have the means and ongoing incentive to get a MapReduce framework tailored to their particular needs and reap the benefits. Would a mid-size firm know how? It seems doubtful. While it has claimed that MapReduce is easy to use, even for programmers without experience with distributed systems, I know from field experience with customers that it does, in fact, take some pretty experienced folks to make best use of it.

Projects like Hive, Google Sawzall, Yahoo Pig and companies like Cloudera all, in essence, attempt to make the MapReduce paradigm easier for lesser experts to use and, in fact, make it behave for the end user more like a parallel database. But this raises the question: Why? It seems to be a bit of re-inventing the wheel. IT-heavy is not how most businesses operate today, especially in these economic times. The dot-com bubble is long over. Hardware budgets are limited and few companies relish the idea of hiring teams of programming experts to maintain even a valuable IT asset such as their data warehouse. They'd rather buy an off-the-shelf tool designed from the ground up to do high-speed data analytics.

Like MapReduce, commercially available massively parallel processing databases specifically built for rapid, high volume data analytics will provide immense data scale and query fault tolerance. They also have a proven track record of customer deployments and deliver equal if not better performance on Big Data problems. Perhaps as important, today's next-generation MPP analytic databases give businesses the flexibility to draw on a deep pool of IT labor skilled in established conventions such as SQL.

As mentioned earlier, unstructured data seems like a natural for MapReduce analysis. A rising tide of chatter is focused on the increasing problem - and importance - of unstructured data. There is more than a bit of truth to this. As the Internet of everything becomes more and more a reality, data is generated everywhere; but our experience to date is that businesses are most interested in data derived from the transactional systems they've wired their businesses on top of, where structure is a given.

Another difficulty faces companies even as MapReduce becomes more integrated into the overall enterprise data analysis strategy. MapReduce is a framework. As the hype and interest have grown, MapReduce solutions are being created by database vendors in entirely non-standard and incompatible ways. This will further limit the likelihood that it will become the centerpiece of an EDW. Business has demonstrated time and again that it prefers open standards and interoperability.

Finally, I believe a move toward a programmer-centric approach to data analysis is both inefficient and contrary to all other prevailing trends of technology use in the enterprise. From the mobile workforce to the rise of social enterprise computing, the momentum is away from hierarchy. I believe this trend is the only way the problem of making Big Data actionable will be effectively addressed. In his classic book on the virtues of open source programming, The Cathedral and the Bazaar, Eric S. Raymond put forth the idea that open source was an effective way to address the complexity and density of information inherent in developing good software code. His proposition, "given enough eyeballs, all bugs are shallow," could easily be restated for Big Data as, "given enough analysts, all trends are apparent." The trick is - and really always has been - to get more people looking at the data. You don't achieve that end by centering your data analytics efforts on a tool largely geared to the skills of technical wizards.

MapReduce-type solutions as they currently exist are most effective when utilized by programmer-led organizations focused on maximizing their growing IT assets. For most businesses seeking the most efficient way to quickly turn their most valuable data into revenue generating insight, MPP databases will likely continue to hold sway, even as MapReduce-based solutions find a supporting role.

More Stories By Roger Gaskell

Roger Gaskell, CTO of Kognitio, has overall responsibility for all product development. He has been instrumental in all generations of the WX and WX2 database products to date, including evolving it from a database application running on proprietary hardware, to a software-only analytical database built on industry-standard blade servers.

Prior to Kognitio, Roger was test and development manager at AB Electronics for five years. During this time his primary responsibility was for the famous BBC Micro Computer and the development and testing of the first mass production of personal computers for IBM.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@MicroservicesExpo Stories
The 5th International DevOps Summit, co-located with 17th International Cloud Expo – being held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the...
Over the years, a variety of methodologies have emerged in order to overcome the challenges related to project constraints. The successful use of each methodology seems highly context-dependent. However, communication seems to be the common denominator of the many challenges that project management methodologies intend to resolve. In this respect, Information and Communication Technologies (ICTs) can be viewed as powerful tools for managing projects. Few research papers have focused on the way...
As the world moves from DevOps to NoOps, application deployment to the cloud ought to become a lot simpler. However, applications have been architected with a much tighter coupling than it needs to be which makes deployment in different environments and migration between them harder. The microservices architecture, which is the basis of many new age distributed systems such as OpenStack, Netflix and so on is at the heart of CloudFoundry – a complete developer-oriented Platform as a Service (PaaS...
17th Cloud Expo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterprises a...
The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete at launch. DevOps may be disruptive, but it is essential. The DevOps Summit at Cloud Expo – to be held June 3-5, 2015, at the Javits Center in New York City – will expand the DevOps community, enable a wide...
Enterprises are fast realizing the importance of integrating SaaS/Cloud applications, API and on-premises data and processes, to unleash hidden value. This webinar explores how managers can use a Microservice-centric approach to aggressively tackle the unexpected new integration challenges posed by proliferation of cloud, mobile, social and big data projects. Industry analyst and SOA expert Jason Bloomberg will strip away the hype from microservices, and clearly identify their advantages and d...
Cloud Expo, Inc. has announced today that Andi Mann returns to DevOps Summit 2015 as Conference Chair. The 4th International DevOps Summit will take place on June 9-11, 2015, at the Javits Center in New York City. "DevOps is set to be one of the most profound disruptions to hit IT in decades," said Andi Mann. "It is a natural extension of cloud computing, and I have seen both firsthand and in independent research the fantastic results DevOps delivers. So I am excited to help the great team at ...
There is no question that the cloud is where businesses want to host data. Until recently hypervisor virtualization was the most widely used method in cloud computing. Recently virtual containers have been gaining in popularity, and for good reason. In the debate between virtual machines and containers, the latter have been seen as the new kid on the block – and like other emerging technology have had some initial shortcomings. However, the container space has evolved drastically since coming on...
Container frameworks, such as Docker, provide a variety of benefits, including density of deployment across infrastructure, convenience for application developers to push updates with low operational hand-holding, and a fairly well-defined deployment workflow that can be orchestrated. Container frameworks also enable a DevOps approach to application development by cleanly separating concerns between operations and development teams. But running multi-container, multi-server apps with containers ...
Converging digital disruptions is creating a major sea change - Cisco calls this the Internet of Everything (IoE). IoE is the network connection of People, Process, Data and Things, fueled by Cloud, Mobile, Social, Analytics and Security, and it represents a $19Trillion value-at-stake over the next 10 years. In her keynote at @ThingsExpo, Manjula Talreja, VP of Cisco Consulting Services, will discuss IoE and the enormous opportunities it provides to public and private firms alike. She will shar...
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo in Silicon Valley. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading in...
The integration between the 2 solutions is handled by a module provided by XebiaLabs that will ensure the containers are correctly defined in the XL Deloy repository based on the information managed by Puppet. It uses the REST API offered by the XL Deploy server: so the security permissions are checked as a operator could do it using the GUI or the CLI. This article shows you how use the xebialabs/xldeploy Puppet module. The Production environment is based on 2 tomcats instances (tomcat1 &...
SYS-CON Events announced today that EnterpriseDB (EDB), the leading worldwide provider of enterprise-class Postgres products and database compatibility solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. EDB is the largest provider of Postgres software and services that provides enterprise-class performance and scalability and the open source freedom to divert budget from more costly traditiona...
How can you compare one technology or tool to its competitors? Usually, there is no objective comparison available. So how do you know which is better? Eclipse or IntelliJ IDEA? Java EE or Spring? C# or Java? All you can usually find is a holy war and biased comparisons on vendor sites. But luckily, sometimes, you can find a fair comparison. How does this come to be? By having it co-authored by the stakeholders. The binary repository comparison matrix is one of those rare resources. It is edite...
With the advent of micro-services, the application design paradigm has undergone a major shift. The days of developing monolithic applications are over. We are bringing in the principles (read SOA) hereto the preserve of applications or system integration space into the application development world. Since the micro-services are consumed within the application, the need of ESB is not there. There is no message transformation or mediations required. But service discovery and load balancing of ...
Do you think development teams really update those BMC Remedy tickets with all the changes contained in a release? They don't. Most of them just "check the box" and move on. They rose a Risk Level that won't raise questions from the Change Control managers and they work around the checks and balances. The alternative is to stop and wait for a department that still thinks releases are rare events. When a release happens every day there's just not enough time for people to attend CAB meeting...
T-Mobile has been transforming the wireless industry with its “Uncarrier” initiatives. Today as T-Mobile’s IT organization works to transform itself in a like manner, technical foundations built over the last couple of years are now key to their drive for more Agile delivery practices. In his session at DevOps Summit, Martin Krienke, Sr Development Manager at T-Mobile, will discuss where they started their Continuous Delivery journey, where they are today, and where they are going in an effort ...
SYS-CON Events announced today that the "First Containers & Microservices Conference" will take place June 9-11, 2015, at the Javits Center in New York City. The “Second Containers & Microservices Conference” will take place November 3-5, 2015, at Santa Clara Convention Center, Santa Clara, CA. Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities.
Disruptive macro trends in technology are impacting and dramatically changing the "art of the possible" relative to supply chain management practices through the innovative use of IoT, cloud, machine learning and Big Data to enable connected ecosystems of engagement. Enterprise informatics can now move beyond point solutions that merely monitor the past and implement integrated enterprise fabrics that enable end-to-end supply chain visibility to improve customer service delivery and optimize sup...
Buzzword alert: Microservices and IoT at a DevOps conference? What could possibly go wrong? In this Power Panel at DevOps Summit, moderated by Jason Bloomberg, the leading expert on architecting agility for the enterprise and president of Intellyx, panelists will peel away the buzz and discuss the important architectural principles behind implementing IoT solutions for the enterprise. As remote IoT devices and sensors become increasingly intelligent, they become part of our distributed cloud en...