Welcome!

Microservices Expo Authors: Carmen Gonzalez, Derek Weeks, Elizabeth White, Pat Romanski, Liz McMillan

Related Topics: Microservices Expo, Microsoft Cloud, Open Source Cloud

Microservices Expo: Article

Making Sense of Large and Growing Data Volumes

MapReduce won’t overtake the enterprise data warehouse industry anytime soon

Is MapReduce the Holy Grail answer to the pressing problem of processing, analyzing and making sense of large and growing data volumes? Certainly it has potential in this arena, but there is a distressing gap between the amount of hype this technology - and its spinoffs - has received and the number of professionals who actually know how to integrate and make best use of it.

Industry watchers say it's just a matter of time before MapReduce sweeps through the enterprise data warehouse (EDW) market the same way open source technologies like Linux have done. In fact, in a recent blog post, Forrester's James Kobielus proclaimed that most EDW vendors will incorporate support for MapReduce's open source cousin Hadoop into the heart of their architectures to enable open, standards-based data analytics on massive amounts of data.

So, no more databases, just MapReduce? I'm not so sure. But don't misunderstand. It's not that MapReduce isn't an effective way to analyze data in some cases. The big names in Internet business are all using it - Facebook, Google, Amazon, eBay et al - so it must be good, right? But it's worth taking a more measured view based both on the technical and the practical business merits. I believe that the two technologies are not so mutually exclusive; that they will work hand-in-hand and, in some cases, MapReduce will be integrated into the relational database (RDBMS).

Google certainly has proven that MapReduce excels at making sense out of the exabytes of unstructured data on the web, which it should, given that MapReduce was designed from the outset for manipulating very large data sets. MapReduce in this sense provides a way to put structure around unstructured data. We humans prefer structure; it's in our DNA. Without structure, we have no real way of adding value to the data. Unstructured data analytics is something of an oxymoron for a pattern-seeking hominid.

MapReduce helps us put structure around the unstructured so we can then make sense of it. It creates an environment wherein a data analyst can write two simple functions, a "mapper" and a "reducer," to perform the actual data manipulation, returning a result that is at once both an analysis of the data it has just mapped and summarized, as well as the structure for further analysis that will help provide insight into the data. Whether that further analysis is done in a MapReduce environment might be the more appropriate question.

From an infrastructure standpoint, MapReduce excels where performance and scalability are challenges. Applications written using the MapReduce framework are automatically parallelized, making it well suited to a large infrastructure of connected machines. As it scales applications across lots of servers made up of lots of nodes, the MapReduce framework also provides built-in query fault tolerance so that whatever hardware component might fail, a query would be completed by another machine. Further, MapReduce and its open source brethren can perform functions not possible in standard SQL (click-stream sessionization, nPath, graph production of potentially unbounded length in SQL).

What's not to love? At a basic level I believe the MapReduce framework is an inefficient way of analyzing data for the vast majority of businesses. The aforementioned capabilities of MapReduce are all well and good, provided you have a Google-like business replete with legions of programmers and vast amounts of server and memory capacity. Viewed from this perspective, it makes perfect sense that Google developed and used MapReduce: because it could. It had a huge and growing resource in its farms of custom-made servers, as well as armies of programmers constantly looking for new ways to take advantage of that seemingly infinite hardware (and the data collected on it), to do cool new things.

Similarly, the other high-profile adopters and advocates are also IT-savvy, IT-heavy companies and, like Google, have the means and ongoing incentive to get a MapReduce framework tailored to their particular needs and reap the benefits. Would a mid-size firm know how? It seems doubtful. While it has claimed that MapReduce is easy to use, even for programmers without experience with distributed systems, I know from field experience with customers that it does, in fact, take some pretty experienced folks to make best use of it.

Projects like Hive, Google Sawzall, Yahoo Pig and companies like Cloudera all, in essence, attempt to make the MapReduce paradigm easier for lesser experts to use and, in fact, make it behave for the end user more like a parallel database. But this raises the question: Why? It seems to be a bit of re-inventing the wheel. IT-heavy is not how most businesses operate today, especially in these economic times. The dot-com bubble is long over. Hardware budgets are limited and few companies relish the idea of hiring teams of programming experts to maintain even a valuable IT asset such as their data warehouse. They'd rather buy an off-the-shelf tool designed from the ground up to do high-speed data analytics.

Like MapReduce, commercially available massively parallel processing databases specifically built for rapid, high volume data analytics will provide immense data scale and query fault tolerance. They also have a proven track record of customer deployments and deliver equal if not better performance on Big Data problems. Perhaps as important, today's next-generation MPP analytic databases give businesses the flexibility to draw on a deep pool of IT labor skilled in established conventions such as SQL.

As mentioned earlier, unstructured data seems like a natural for MapReduce analysis. A rising tide of chatter is focused on the increasing problem - and importance - of unstructured data. There is more than a bit of truth to this. As the Internet of everything becomes more and more a reality, data is generated everywhere; but our experience to date is that businesses are most interested in data derived from the transactional systems they've wired their businesses on top of, where structure is a given.

Another difficulty faces companies even as MapReduce becomes more integrated into the overall enterprise data analysis strategy. MapReduce is a framework. As the hype and interest have grown, MapReduce solutions are being created by database vendors in entirely non-standard and incompatible ways. This will further limit the likelihood that it will become the centerpiece of an EDW. Business has demonstrated time and again that it prefers open standards and interoperability.

Finally, I believe a move toward a programmer-centric approach to data analysis is both inefficient and contrary to all other prevailing trends of technology use in the enterprise. From the mobile workforce to the rise of social enterprise computing, the momentum is away from hierarchy. I believe this trend is the only way the problem of making Big Data actionable will be effectively addressed. In his classic book on the virtues of open source programming, The Cathedral and the Bazaar, Eric S. Raymond put forth the idea that open source was an effective way to address the complexity and density of information inherent in developing good software code. His proposition, "given enough eyeballs, all bugs are shallow," could easily be restated for Big Data as, "given enough analysts, all trends are apparent." The trick is - and really always has been - to get more people looking at the data. You don't achieve that end by centering your data analytics efforts on a tool largely geared to the skills of technical wizards.

MapReduce-type solutions as they currently exist are most effective when utilized by programmer-led organizations focused on maximizing their growing IT assets. For most businesses seeking the most efficient way to quickly turn their most valuable data into revenue generating insight, MPP databases will likely continue to hold sway, even as MapReduce-based solutions find a supporting role.

More Stories By Roger Gaskell

Roger Gaskell, CTO of Kognitio, has overall responsibility for all product development. He has been instrumental in all generations of the WX and WX2 database products to date, including evolving it from a database application running on proprietary hardware, to a software-only analytical database built on industry-standard blade servers.

Prior to Kognitio, Roger was test and development manager at AB Electronics for five years. During this time his primary responsibility was for the famous BBC Micro Computer and the development and testing of the first mass production of personal computers for IBM.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@MicroservicesExpo Stories
Internet of @ThingsExpo, taking place June 6-8, 2017 at the Javits Center in New York City, New York, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @ThingsExpo New York Call for Papers is now open.
When building DevOps or continuous delivery practices you can learn a great deal from others. What choices did they make, what practices did they put in place, and how did they connect the dots? At Sonatype, we pulled together a set of 21 reference architectures for folks building continuous delivery and DevOps practices using Docker. Why? After 3,000 DevOps professionals attended our webinar on "Continuous Integration using Docker" discussing just one reference architecture example, we recogn...
When you focus on a journey from up-close, you look at your own technical and cultural history and how you changed it for the benefit of the customer. This was our starting point: too many integration issues, 13 SWP days and very long cycles. It was evident that in this fast-paced industry we could no longer afford this reality. We needed something that would take us beyond reducing the development lifecycles, CI and Agile methodologies. We made a fundamental difference, even changed our culture...
The proper isolation of resources is essential for multi-tenant environments. The traditional approach to isolate resources is, however, rather heavyweight. In his session at 18th Cloud Expo, Igor Drobiazko, co-founder of elastic.io, drew upon his own experience with operating a Docker container-based infrastructure on a large scale and present a lightweight solution for resource isolation using microservices. He also discussed the implementation of microservices in data and application integrat...
Containers have changed the mind of IT in DevOps. They enable developers to work with dev, test, stage and production environments identically. Containers provide the right abstraction for microservices and many cloud platforms have integrated them into deployment pipelines. DevOps and containers together help companies achieve their business goals faster and more effectively. In his session at DevOps Summit, Ruslan Synytsky, CEO and Co-founder of Jelastic, reviewed the current landscape of Dev...
In his General Session at DevOps Summit, Asaf Yigal, Co-Founder & VP of Product at Logz.io, will explore the value of Kibana 4 for log analysis and will give a real live, hands-on tutorial on how to set up Kibana 4 and get the most out of Apache log files. He will examine three use cases: IT operations, business intelligence, and security and compliance. This is a hands-on session that will require participants to bring their own laptops, and we will provide the rest.
"We're bringing out a new application monitoring system to the DevOps space. It manages large enterprise applications that are distributed throughout a node in many enterprises and we manage them as one collective," explained Kevin Barnes, President of eCube Systems, in this SYS-CON.tv interview at DevOps at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
As the race for the presidency heats up, IT leaders would do well to recall the famous catchphrase from Bill Clinton’s successful 1992 campaign against George H. W. Bush: “It’s the economy, stupid.” That catchphrase is important, because IT economics are important. Especially when it comes to cloud. Application performance management (APM) for the cloud may turn out to be as much about those economics as it is about customer experience.
@DevOpsSummit at Cloud taking place June 6-8, 2017, at Javits Center, New York City, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long developm...
Updating DevOps to the latest production data slows down your development cycle. Probably it is due to slow, inefficient conventional storage and associated copy data management practices. In his session at @DevOpsSummit at 20th Cloud Expo, Dhiraj Sehgal, in Product and Solution at Tintri, will talk about DevOps and cloud-focused storage to update hundreds of child VMs (different flavors) with updates from a master VM in minutes, saving hours or even days in each development cycle. He will also...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
@DevOpsSummit taking place June 6-8, 2017 at Javits Center, New York City, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @DevOpsSummit at Cloud Expo New York Call for Papers is now open.
SYS-CON Events announced today that Dataloop.IO, an innovator in cloud IT-monitoring whose products help organizations save time and money, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Dataloop.IO is an emerging software company on the cutting edge of major IT-infrastructure trends including cloud computing and microservices. The company, founded in the UK but now based in San Fran...
SYS-CON Events announced today that Super Micro Computer, Inc., a global leader in Embedded and IoT solutions, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 7-9, 2017, at the Javits Center in New York City, NY. Supermicro (NASDAQ: SMCI), the leading innovator in high-performance, high-efficiency server technology, is a premier provider of advanced server Building Block Solutions® for Data Center, Cloud Computing, Enterprise IT, Hadoop/Big Data, HPC and E...
SYS-CON Events announced today that Linux Academy, the foremost online Linux and cloud training platform and community, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Linux Academy was founded on the belief that providing high-quality, in-depth training should be available at an affordable price. Industry leaders in quality training, provided services, and student certification passes, its goal is to c...
The unique combination of Amazon Web Services and Cloud Raxak, a Gartner Cool Vendor in IT Automation, provides a seamless and cost-effective way of securely moving on-premise IT workloads to Amazon Web Services. Any enterprise can now leverage the cloud, manage risk, and maintain continuous security compliance. Forrester's analysis shows that enterprises need automated security to lower security risk and decrease IT operational costs. Through the seamless integration into Amazon Web Services, ...
Software development is a moving target. You have to keep your eye on trends in the tech space that haven’t even happened yet just to stay current. Consider what’s happened with augmented reality (AR) in this year alone. If you said you were working on an AR app in 2015, you might have gotten a lot of blank stares or jokes about Google Glass. Then Pokémon GO happened. Like AR, the trends listed below have been building steam for some time, but they’ll be taking off in surprising new directions b...
A lot of time, resources and energy has been invested over the past few years on de-siloing development and operations. And with good reason. DevOps is enabling organizations to more aggressively increase their digital agility, while at the same time reducing digital costs and risks. But as 2017 approaches, the hottest trends in DevOps aren’t specifically about dev or ops. They’re about testing, security, and metrics.
You often hear the two titles of "DevOps" and "Immutable Infrastructure" used independently. In his session at DevOps Summit, John Willis, Technical Evangelist for Docker, covered the union between the two topics and why this is important. He provided an overview of Immutable Infrastructure then showed how an Immutable Continuous Delivery pipeline can be applied as a best practice for "DevOps." He ended the session with some interesting case study examples.
Software delivery was once specific to the IT industry. Now, Continuous Delivery pipelines are used around world from e-commerce to airline software. Building a software delivery pipeline once involved hours of scripting and manual steps–a process that’s painful, if not impossible, to scale. However Continuous Delivery with Application Release Automation tools offers a scripting-free, automated experience. Continuous Delivery pipelines are immensely powerful for the modern enterprise, boosting ...