Welcome!

Microservices Expo Authors: Pat Romanski, Elizabeth White, Liz McMillan, Charles Araujo, Flint Brenton

Related Topics: Microservices Expo, @CloudExpo

Microservices Expo: Article

Deduplication: When, Where and How

Deduplication gives you the ability to do more with less

Nearly every enterprise can benefit from deduplication. Business data has been growing exponentially. Routine backups of that data have become too costly or simply ineffective. Deduplication can help by reducing the cost of primary and secondary storage. Essentially, limited resources are made much more effective and efficient.

What most organizations don't realize is how much deduplication technology has matured. Originally, deduplication was used as an alternative to tape for backup and disaster recovery. This user case continues today and has become one of the predominant solutions for data protection. As it has matured, it has begun to evolve from being a point solution at the end of a backup chain (the target) to a player in every step of the backup process: at the client side, at the network side, at the media server side, as well as at the target device. Backup and storage vendors are implementing this technology in all aspects of their solutions.

Storage vendors have also recognized the efficiencies available by deduping data. In addition to implementing space efficiency technologies in their storage arrays, they're offering deduplication as a way to both improve available capacity and optimize data transmission when replicating data.

Versatile Technology
With these advancements it's possible to leverage deduplication to solve a variety of storage problems. In the data protection space, IT departments face increasing pressure to offer faster backups, even faster restores, and to do them with fewer resources than in the past. Data protection solutions that offer deduplication can, at the very least, significantly reduce the cost of protection to disk - often by more than a 20x reduction.

However, and perhaps more important, recovering lost information from these solutions is typically a lot faster than legacy tape solutions. A properly designed data protection solution that leverages deduplication can often either completely eliminate tape, or relegate tape to an archival medium. In addition, many companies using such a solution are able to replicate all of their backup data from one site to another. This eliminates the need for third-party tape handling and greatly improves the recoverability of the enterprise's data.

For enterprises that employ a replication strategy, deduplication can offer significant efficiencies depending on the data being replicated. If the data has a great deal of repetition or commonality, dedupe can offer tremendous boosts in performance. However, if the data is not very repetitious, deduplication will not offer as great an improvement. For most replication types, enterprises can expect a 2x to 4x reduction in bandwidth requirements.

More and more, storage vendors are offering deduplication on primary storage. Primary storage dedupe is a good idea when the data that is being stored has a lot of commonality - in other words, similar data being stored in one location. A good example of this is virtual environments. In such a situation, virtual machines are being stored as big files. Each has a lot in common - the operating system, unused blank space and, in many case, the applications themselves. Disk devices that can do primary storage deduplication would be able to reduce all of this data to a single instance. Regardless of the hypervisor used - VMware, HyperV and so on - there is a huge amount of commonality between each of the virtual machine instances. In fact, it's common to be able to reduce storage requirements in virtual environments by over 80% through deduplication.

Other primary environments, however, don't present a lot of common data, and thus will not benefit from deduplication. What's more, the process of uncovering which blocks of data have been seen before is expensive in both compute resources and I/O bandwidth. Both of those are at a premium in storage array controllers. A knowledgeable designer will typically look at the application type, the data type, and the resources available on the storage array that's doing the dedupe. Once all of these variables are factored together, it's possible to decide if it makes sense to use deduplication on primary storage.

Technique Pros and Cons
While there's a lot to consider when designing a deduplication strategy, a lot of the decisions are fairly nuanced. For example, the two most common techniques for performing deduplication are hashing and delta differencing. Backup appliances use one or the other, or in some instances a hybrid of the two. Which is the preferred technique depends on who you're talking to.

At a high level, hashing and delta differencing are very similar. The net effect of both is that common patterns of data are reduced and you end up with a greatly reduced storage requirement. The difference is in how you determine if there is a common pattern of data. With hashing implementations, the vendors run small blocks of the data through a mathematical algorithm and compute whether they have seen the same data before. This computation theoretically does not offer 100% certainty whether or not a piece of data has been seen before. However, statistically it is almost a certainty - so much so that you'd be more likely to win the mega lottery - dozens of times in a row. The consensus is that this is good enough, and most vendors have used hashing to develop their solutions.

For reasons involving technical implementation, performance tradeoffs, and arguably higher reliability, some vendors have chosen to develop their solutions through delta differencing. With this technology, each small piece of data is actually compared, bit for bit, with everything that has been seen before. This guarantees that the data has or has not already been seen.

Regardless of the implementation used, the odds are more in favor of external failures s power outage, water damage, satellite falling on the data center - than on technologies that determine how data bits are identified as the same. In most deduplication designs it's more important to focus on the features and functionality of the overall solution, rather than this specific level of detail.

Another topic has to do with the timing of deduplication. Inline deduplication processes dedupe backup data in real-time, as it's received at the front end of the Virtual Tape Library (VTL) or Disk-to-Disk (D2D) device. Post-process methods, on the other hand, remove duplicate data after the backup has completed. Regardless of which method is used, the same amount of work is being done.

The question of whether it makes more sense to do inline or post-process deduplication can best be answered by, "it depends." Regardless of when you do it, deduplication is inherently an expensive thing to do in terms of CPU and I/O resources. Choosing between inline and post-process is essentially choosing between paying for the service upfront or after. With some vendors' technologies, you have no choice. You have to use either inline or post. With others you get the choice, although it's something of a black art to figure out when to best use one versus the other.

Typically it comes down to optimizing the speed of ingest (how fast you get the data into the device) with rehydration (how fast you get the data back), and striking a balance between the two. Your best recommendation is to work with someone who has earned the scar tissue from using both these technologies.

Achieving Maximum Efficiency
Now that deduplication is so prevalent, the challenge most of our customers face is identifying which one to use and when. This is particularly difficult since each vendor unequivocally states that their solution is better than everyone else's and is the "one true way." In reality, there are no simple black and white answers and each solution's merits must be weighed individually.

To develop the best possible deduplication solution, it's important to first determine the problem you're trying to solve. Conduct an internal analysis, and then approach a partner who has an unbiased approach to solving the issue at hand. The right partner can help you sort through the hype and identify solutions and best practices that will align with your business needs.

The benefits of deduplication are many. Capital expenses are greatly reduced; you need fewer disks, less tape, and less bandwidth to accomplish the same task. If used appropriately, deduplication will also improve your operational efficiencies, which you can then leverage to reduce your operational expenses.

Simply put, deduplication gives you the ability to do more with less. Whether in networking, primary storage, backup or for data archival protection, a well-designed deduplication solution can help you mitigate the challenges of big data - and keep your IT landscape lean, fast and efficient.

More Stories By Juan Orlandini

A practice manager for Datalink, Juan Orlandini is a 25+ year veteran of the open systems IT industry. Throughout his career, he has been involved in the design and deployment of many large and advanced storage, data protection, and high availability infrastructures.

Juan evaluates next-generation technologies for Datalink and also works with end users, assisting them with architecting and implementing strategic data center architectures. In his current role, he is developing managed services offerings designed to help companies optimize staff productivity and data center efficiency. He continues to evaluate industry solutions, customer needs, and blogs about it at blog.datalink.com

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@MicroservicesExpo Stories
For organizations that have amassed large sums of software complexity, taking a microservices approach is the first step toward DevOps and continuous improvement / development. Integrating system-level analysis with microservices makes it easier to change and add functionality to applications at any time without the increase of risk. Before you start big transformation projects or a cloud migration, make sure these changes won’t take down your entire organization.
When you focus on a journey from up-close, you look at your own technical and cultural history and how you changed it for the benefit of the customer. This was our starting point: too many integration issues, 13 SWP days and very long cycles. It was evident that in this fast-paced industry we could no longer afford this reality. We needed something that would take us beyond reducing the development lifecycles, CI and Agile methodologies. We made a fundamental difference, even changed our culture...
In his session at 20th Cloud Expo, Mike Johnston, an infrastructure engineer at Supergiant.io, discussed how to use Kubernetes to set up a SaaS infrastructure for your business. Mike Johnston is an infrastructure engineer at Supergiant.io with over 12 years of experience designing, deploying, and maintaining server and workstation infrastructure at all scales. He has experience with brick and mortar data centers as well as cloud providers like Digital Ocean, Amazon Web Services, and Rackspace. H...
You often hear the two titles of "DevOps" and "Immutable Infrastructure" used independently. In his session at DevOps Summit, John Willis, Technical Evangelist for Docker, covered the union between the two topics and why this is important. He provided an overview of Immutable Infrastructure then showed how an Immutable Continuous Delivery pipeline can be applied as a best practice for "DevOps." He ended the session with some interesting case study examples.
Without lifecycle traceability and visibility across the tool chain, stakeholders from Planning-to-Ops have limited insight and answers to who, what, when, why and how across the DevOps lifecycle. This impacts the ability to deliver high quality software at the needed velocity to drive positive business outcomes. In his general session at @DevOpsSummit at 19th Cloud Expo, Eric Robertson, General Manager at CollabNet, will discuss how customers are able to achieve a level of transparency that e...
The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, compared the Jevons Paradox to modern-day enterprise IT, examin...
The taxi industry never saw Uber coming. Startups are a threat to incumbents like never before, and a major enabler for startups is that they are instantly “cloud ready.” If innovation moves at the pace of IT, then your company is in trouble. Why? Because your data center will not keep up with frenetic pace AWS, Microsoft and Google are rolling out new capabilities. In his session at 20th Cloud Expo, Don Browning, VP of Cloud Architecture at Turner, posited that disruption is inevitable for comp...
The next XaaS is CICDaaS. Why? Because CICD saves developers a huge amount of time. CD is an especially great option for projects that require multiple and frequent contributions to be integrated. But… securing CICD best practices is an emerging, essential, yet little understood practice for DevOps teams and their Cloud Service Providers. The only way to get CICD to work in a highly secure environment takes collaboration, patience and persistence. Building CICD in the cloud requires rigorous ar...
"This all sounds great. But it's just not realistic." This is what a group of five senior IT executives told me during a workshop I held not long ago. We were working through an exercise on the organizational characteristics necessary to successfully execute a digital transformation, and the group was doing their ‘readout.' The executives loved everything we discussed and agreed that if such an environment existed, it would make transformation much easier. They just didn't believe it was reali...
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, provided a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services with...
Containers are rapidly finding their way into enterprise data centers, but change is difficult. How do enterprises transform their architecture with technologies like containers without losing the reliable components of their current solutions? In his session at @DevOpsSummit at 21st Cloud Expo, Tony Campbell, Director, Educational Services at CoreOS, will explore the challenges organizations are facing today as they move to containers and go over how Kubernetes applications can deploy with lega...
The “Digital Era” is forcing us to engage with new methods to build, operate and maintain applications. This transformation also implies an evolution to more and more intelligent applications to better engage with the customers, while creating significant market differentiators. In both cases, the cloud has become a key enabler to embrace this digital revolution. So, moving to the cloud is no longer the question; the new questions are HOW and WHEN. To make this equation even more complex, most ...
Learn how to solve the problem of keeping files in sync between multiple Docker containers. In his session at 16th Cloud Expo, Aaron Brongersma, Senior Infrastructure Engineer at Modulus, discussed using rsync, GlusterFS, EBS and Bit Torrent Sync. He broke down the tools that are needed to help create a seamless user experience. In the end, can we have an environment where we can easily move Docker containers, servers, and volumes without impacting our applications? He shared his results so yo...
Don’t go chasing waterfall … development, that is. According to a recent post by Madison Moore on Medium featuring insights from several software delivery industry leaders, waterfall is – while still popular – not the best way to win in the marketplace. With methodologies like Agile, DevOps and Continuous Delivery becoming ever more prominent over the past 15 years or so, waterfall is old news. Or, is it? Moore cites a recent study by Gartner: “According to Gartner’s IT Key Metrics Data report, ...
Enterprise architects are increasingly adopting multi-cloud strategies as they seek to utilize existing data center assets, leverage the advantages of cloud computing and avoid cloud vendor lock-in. This requires a globally aware traffic management strategy that can monitor infrastructure health across data centers and end-user experience globally, while responding to control changes and system specification at the speed of today’s DevOps teams. In his session at 20th Cloud Expo, Josh Gray, Chie...
Kubernetes is a new and revolutionary open-sourced system for managing containers across multiple hosts in a cluster. Ansible is a simple IT automation tool for just about any requirement for reproducible environments. In his session at @DevOpsSummit at 18th Cloud Expo, Patrick Galbraith, a principal engineer at HPE, discussed how to build a fully functional Kubernetes cluster on a number of virtual machines or bare-metal hosts. Also included will be a brief demonstration of running a Galera MyS...
Many organizations are now looking to DevOps maturity models to gauge their DevOps adoption and compare their maturity to their peers. However, as enterprise organizations rush to adopt DevOps, moving past experimentation to embrace it at scale, they are in danger of falling into the trap that they have fallen into time and time again. Unfortunately, we've seen this movie before, and we know how it ends: badly.
Agile has finally jumped the technology shark, expanding outside the software world. Enterprises are now increasingly adopting Agile practices across their organizations in order to successfully navigate the disruptive waters that threaten to drown them. In our quest for establishing change as a core competency in our organizations, this business-centric notion of Agile is an essential component of Agile Digital Transformation. In the years since the publication of the Agile Manifesto, the conn...
"I focus on what we are calling CAST Highlight, which is our SaaS application portfolio analysis tool. It is an extremely lightweight tool that can integrate with pretty much any build process right now," explained Andrew Siegmund, Application Migration Specialist for CAST, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
In IT, we sometimes coin terms for things before we know exactly what they are and how they’ll be used. The resulting terms may capture a common set of aspirations and goals – as “cloud” did broadly for on-demand, self-service, and flexible computing. But such a term can also lump together diverse and even competing practices, technologies, and priorities to the point where important distinctions are glossed over and lost.