Welcome!

Microservices Expo Authors: Chris Schwarz, Todd Matters, Elizabeth White, Pat Romanski, Stackify Blog

Related Topics: Microservices Expo, @CloudExpo

Microservices Expo: Article

Deduplication: When, Where and How

Deduplication gives you the ability to do more with less

Nearly every enterprise can benefit from deduplication. Business data has been growing exponentially. Routine backups of that data have become too costly or simply ineffective. Deduplication can help by reducing the cost of primary and secondary storage. Essentially, limited resources are made much more effective and efficient.

What most organizations don't realize is how much deduplication technology has matured. Originally, deduplication was used as an alternative to tape for backup and disaster recovery. This user case continues today and has become one of the predominant solutions for data protection. As it has matured, it has begun to evolve from being a point solution at the end of a backup chain (the target) to a player in every step of the backup process: at the client side, at the network side, at the media server side, as well as at the target device. Backup and storage vendors are implementing this technology in all aspects of their solutions.

Storage vendors have also recognized the efficiencies available by deduping data. In addition to implementing space efficiency technologies in their storage arrays, they're offering deduplication as a way to both improve available capacity and optimize data transmission when replicating data.

Versatile Technology
With these advancements it's possible to leverage deduplication to solve a variety of storage problems. In the data protection space, IT departments face increasing pressure to offer faster backups, even faster restores, and to do them with fewer resources than in the past. Data protection solutions that offer deduplication can, at the very least, significantly reduce the cost of protection to disk - often by more than a 20x reduction.

However, and perhaps more important, recovering lost information from these solutions is typically a lot faster than legacy tape solutions. A properly designed data protection solution that leverages deduplication can often either completely eliminate tape, or relegate tape to an archival medium. In addition, many companies using such a solution are able to replicate all of their backup data from one site to another. This eliminates the need for third-party tape handling and greatly improves the recoverability of the enterprise's data.

For enterprises that employ a replication strategy, deduplication can offer significant efficiencies depending on the data being replicated. If the data has a great deal of repetition or commonality, dedupe can offer tremendous boosts in performance. However, if the data is not very repetitious, deduplication will not offer as great an improvement. For most replication types, enterprises can expect a 2x to 4x reduction in bandwidth requirements.

More and more, storage vendors are offering deduplication on primary storage. Primary storage dedupe is a good idea when the data that is being stored has a lot of commonality - in other words, similar data being stored in one location. A good example of this is virtual environments. In such a situation, virtual machines are being stored as big files. Each has a lot in common - the operating system, unused blank space and, in many case, the applications themselves. Disk devices that can do primary storage deduplication would be able to reduce all of this data to a single instance. Regardless of the hypervisor used - VMware, HyperV and so on - there is a huge amount of commonality between each of the virtual machine instances. In fact, it's common to be able to reduce storage requirements in virtual environments by over 80% through deduplication.

Other primary environments, however, don't present a lot of common data, and thus will not benefit from deduplication. What's more, the process of uncovering which blocks of data have been seen before is expensive in both compute resources and I/O bandwidth. Both of those are at a premium in storage array controllers. A knowledgeable designer will typically look at the application type, the data type, and the resources available on the storage array that's doing the dedupe. Once all of these variables are factored together, it's possible to decide if it makes sense to use deduplication on primary storage.

Technique Pros and Cons
While there's a lot to consider when designing a deduplication strategy, a lot of the decisions are fairly nuanced. For example, the two most common techniques for performing deduplication are hashing and delta differencing. Backup appliances use one or the other, or in some instances a hybrid of the two. Which is the preferred technique depends on who you're talking to.

At a high level, hashing and delta differencing are very similar. The net effect of both is that common patterns of data are reduced and you end up with a greatly reduced storage requirement. The difference is in how you determine if there is a common pattern of data. With hashing implementations, the vendors run small blocks of the data through a mathematical algorithm and compute whether they have seen the same data before. This computation theoretically does not offer 100% certainty whether or not a piece of data has been seen before. However, statistically it is almost a certainty - so much so that you'd be more likely to win the mega lottery - dozens of times in a row. The consensus is that this is good enough, and most vendors have used hashing to develop their solutions.

For reasons involving technical implementation, performance tradeoffs, and arguably higher reliability, some vendors have chosen to develop their solutions through delta differencing. With this technology, each small piece of data is actually compared, bit for bit, with everything that has been seen before. This guarantees that the data has or has not already been seen.

Regardless of the implementation used, the odds are more in favor of external failures s power outage, water damage, satellite falling on the data center - than on technologies that determine how data bits are identified as the same. In most deduplication designs it's more important to focus on the features and functionality of the overall solution, rather than this specific level of detail.

Another topic has to do with the timing of deduplication. Inline deduplication processes dedupe backup data in real-time, as it's received at the front end of the Virtual Tape Library (VTL) or Disk-to-Disk (D2D) device. Post-process methods, on the other hand, remove duplicate data after the backup has completed. Regardless of which method is used, the same amount of work is being done.

The question of whether it makes more sense to do inline or post-process deduplication can best be answered by, "it depends." Regardless of when you do it, deduplication is inherently an expensive thing to do in terms of CPU and I/O resources. Choosing between inline and post-process is essentially choosing between paying for the service upfront or after. With some vendors' technologies, you have no choice. You have to use either inline or post. With others you get the choice, although it's something of a black art to figure out when to best use one versus the other.

Typically it comes down to optimizing the speed of ingest (how fast you get the data into the device) with rehydration (how fast you get the data back), and striking a balance between the two. Your best recommendation is to work with someone who has earned the scar tissue from using both these technologies.

Achieving Maximum Efficiency
Now that deduplication is so prevalent, the challenge most of our customers face is identifying which one to use and when. This is particularly difficult since each vendor unequivocally states that their solution is better than everyone else's and is the "one true way." In reality, there are no simple black and white answers and each solution's merits must be weighed individually.

To develop the best possible deduplication solution, it's important to first determine the problem you're trying to solve. Conduct an internal analysis, and then approach a partner who has an unbiased approach to solving the issue at hand. The right partner can help you sort through the hype and identify solutions and best practices that will align with your business needs.

The benefits of deduplication are many. Capital expenses are greatly reduced; you need fewer disks, less tape, and less bandwidth to accomplish the same task. If used appropriately, deduplication will also improve your operational efficiencies, which you can then leverage to reduce your operational expenses.

Simply put, deduplication gives you the ability to do more with less. Whether in networking, primary storage, backup or for data archival protection, a well-designed deduplication solution can help you mitigate the challenges of big data - and keep your IT landscape lean, fast and efficient.

More Stories By Juan Orlandini

A practice manager for Datalink, Juan Orlandini is a 25+ year veteran of the open systems IT industry. Throughout his career, he has been involved in the design and deployment of many large and advanced storage, data protection, and high availability infrastructures.

Juan evaluates next-generation technologies for Datalink and also works with end users, assisting them with architecting and implementing strategic data center architectures. In his current role, he is developing managed services offerings designed to help companies optimize staff productivity and data center efficiency. He continues to evaluate industry solutions, customer needs, and blogs about it at blog.datalink.com

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@MicroservicesExpo Stories
Colocation is a central pillar of modern enterprise infrastructure planning because it provides greater control, insight, and performance than managed platforms. In spite of the inexorable rise of the cloud, most businesses with extensive IT hardware requirements choose to host their infrastructure in colocation data centers. According to a recent IDC survey, more than half of the businesses questioned use colocation services, and the number is even higher among established businesses and busine...
For most organizations, the move to hybrid cloud is now a question of when, not if. Fully 82% of enterprises plan to have a hybrid cloud strategy this year, according to Infoholic Research. The worldwide hybrid cloud computing market is expected to grow about 34% annually over the next five years, reaching $241.13 billion by 2022. Companies are embracing hybrid cloud because of the many advantages it offers compared to relying on a single provider for all of their cloud needs. Hybrid offers bala...
@DevOpsSummit at Cloud Expo taking place Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center, Santa Clara, CA, is co-located with the 21st International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is ...
"We are a monitoring company. We work with Salesforce, BBC, and quite a few other big logos. We basically provide monitoring for them, structure for their cloud services and we fit into the DevOps world" explained David Gildeh, Co-founder and CEO of Outlyer, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
For organizations that have amassed large sums of software complexity, taking a microservices approach is the first step toward DevOps and continuous improvement / development. Integrating system-level analysis with microservices makes it easier to change and add functionality to applications at any time without the increase of risk. Before you start big transformation projects or a cloud migration, make sure these changes won’t take down your entire organization.
"When we talk about cloud without compromise what we're talking about is that when people think about 'I need the flexibility of the cloud' - it's the ability to create applications and run them in a cloud environment that's far more flexible,” explained Matthew Finnie, CTO of Interoute, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Microservices are increasingly used in the development world as developers work to create larger, more complex applications that are better developed and managed as a combination of smaller services that work cohesively together for larger, application-wide functionality. Tools such as Service Fabric are rising to meet the need to think about and build apps using a piece-by-piece methodology that is, frankly, less mind-boggling than considering the whole of the application at once. Today, we'll ...
What's the role of an IT self-service portal when you get to continuous delivery and Infrastructure as Code? This general session showed how to create the continuous delivery culture and eight accelerators for leading the change. Don Demcsak is a DevOps and Cloud Native Modernization Principal for Dell EMC based out of New Jersey. He is a former, long time, Microsoft Most Valuable Professional, specializing in building and architecting Application Delivery Pipelines for hybrid legacy, and cloud ...
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
In his session at Cloud Expo, Alan Winters, an entertainment executive/TV producer turned serial entrepreneur, presented a success story of an entrepreneur who has both suffered through and benefited from offshore development across multiple businesses: The smart choice, or how to select the right offshore development partner Warning signs, or how to minimize chances of making the wrong choice Collaboration, or how to establish the most effective work processes Budget control, or how to ma...
Cloud Expo, Inc. has announced today that Andi Mann and Aruna Ravichandran have been named Co-Chairs of @DevOpsSummit at Cloud Expo Silicon Valley which will take place Oct. 31-Nov. 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. "DevOps is at the intersection of technology and business-optimizing tools, organizations and processes to bring measurable improvements in productivity and profitability," said Aruna Ravichandran, vice president, DevOps product and solutions marketing...
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
There's a lot to gain from cloud computing, but success requires a thoughtful and enterprise focused approach. Cloud computing decouples data and information from the infrastructure on which it lies. A process that is a LOT more involved than dragging some folders from your desktop to a shared drive. Cloud computing as a mission transformation activity, not a technological one. As an organization moves from local information hosting to the cloud, one of the most important challenges is addressi...
In the decade following his article, cloud computing further cemented Carr’s perspective. Compute, storage, and network resources have become simple utilities, available at the proverbial turn of the faucet. The value they provide is immense, but the cloud playing field is amazingly level. Carr’s quote above presaged the cloud to a T. Today, however, we’re in the digital era. Mark Andreesen’s ‘software is eating the world’ prognostication is coming to pass, as enterprises realize they must be...
Hybrid IT is today’s reality, and while its implementation may seem daunting at times, more and more organizations are migrating to the cloud. In fact, according to SolarWinds 2017 IT Trends Index: Portrait of a Hybrid IT Organization 95 percent of organizations have migrated crucial applications to the cloud in the past year. As such, it’s in every IT professional’s best interest to know what to expect.
A common misconception about the cloud is that one size fits all. Companies expecting to run all of their operations using one cloud solution or service must realize that doing so is akin to forcing the totality of their business functionality into a straightjacket. Unlocking the full potential of the cloud means embracing the multi-cloud future where businesses use their own cloud, and/or clouds from different vendors, to support separate functions or product groups. There is no single cloud so...
Both SaaS vendors and SaaS buyers are going “all-in” to hyperscale IaaS platforms such as AWS, which is disrupting the SaaS value proposition. Why should the enterprise SaaS consumer pay for the SaaS service if their data is resident in adjacent AWS S3 buckets? If both SaaS sellers and buyers are using the same cloud tools, automation and pay-per-transaction model offered by IaaS platforms, then why not host the “shrink-wrapped” software in the customers’ cloud? Further, serverless computing, cl...
The taxi industry never saw Uber coming. Startups are a threat to incumbents like never before, and a major enabler for startups is that they are instantly “cloud ready.” If innovation moves at the pace of IT, then your company is in trouble. Why? Because your data center will not keep up with frenetic pace AWS, Microsoft and Google are rolling out new capabilities. In his session at 20th Cloud Expo, Don Browning, VP of Cloud Architecture at Turner, posited that disruption is inevitable for comp...
Companies have always been concerned that traditional enterprise software is slow and complex to install, often disrupting critical and time-sensitive operations during roll-out. With the growing need to integrate new digital technologies into the enterprise to transform business processes, this concern has become even more pressing. A 2016 Panorama Consulting Solutions study revealed that enterprise resource planning (ERP) projects took an average of 21 months to install, with 57 percent of th...
In 2014, Amazon announced a new form of compute called Lambda. We didn't know it at the time, but this represented a fundamental shift in what we expect from cloud computing. Now, all of the major cloud computing vendors want to take part in this disruptive technology. In his session at 20th Cloud Expo, Doug Vanderweide, an instructor at Linux Academy, discussed why major players like AWS, Microsoft Azure, IBM Bluemix, and Google Cloud Platform are all trying to sidestep VMs and containers wit...