Welcome!

Microservices Expo Authors: Liz McMillan, Elizabeth White, Carmen Gonzalez, Pat Romanski, Sematext Blog

Related Topics: Microservices Expo, @CloudExpo

Microservices Expo: Article

Deduplication: When, Where and How

Deduplication gives you the ability to do more with less

Nearly every enterprise can benefit from deduplication. Business data has been growing exponentially. Routine backups of that data have become too costly or simply ineffective. Deduplication can help by reducing the cost of primary and secondary storage. Essentially, limited resources are made much more effective and efficient.

What most organizations don't realize is how much deduplication technology has matured. Originally, deduplication was used as an alternative to tape for backup and disaster recovery. This user case continues today and has become one of the predominant solutions for data protection. As it has matured, it has begun to evolve from being a point solution at the end of a backup chain (the target) to a player in every step of the backup process: at the client side, at the network side, at the media server side, as well as at the target device. Backup and storage vendors are implementing this technology in all aspects of their solutions.

Storage vendors have also recognized the efficiencies available by deduping data. In addition to implementing space efficiency technologies in their storage arrays, they're offering deduplication as a way to both improve available capacity and optimize data transmission when replicating data.

Versatile Technology
With these advancements it's possible to leverage deduplication to solve a variety of storage problems. In the data protection space, IT departments face increasing pressure to offer faster backups, even faster restores, and to do them with fewer resources than in the past. Data protection solutions that offer deduplication can, at the very least, significantly reduce the cost of protection to disk - often by more than a 20x reduction.

However, and perhaps more important, recovering lost information from these solutions is typically a lot faster than legacy tape solutions. A properly designed data protection solution that leverages deduplication can often either completely eliminate tape, or relegate tape to an archival medium. In addition, many companies using such a solution are able to replicate all of their backup data from one site to another. This eliminates the need for third-party tape handling and greatly improves the recoverability of the enterprise's data.

For enterprises that employ a replication strategy, deduplication can offer significant efficiencies depending on the data being replicated. If the data has a great deal of repetition or commonality, dedupe can offer tremendous boosts in performance. However, if the data is not very repetitious, deduplication will not offer as great an improvement. For most replication types, enterprises can expect a 2x to 4x reduction in bandwidth requirements.

More and more, storage vendors are offering deduplication on primary storage. Primary storage dedupe is a good idea when the data that is being stored has a lot of commonality - in other words, similar data being stored in one location. A good example of this is virtual environments. In such a situation, virtual machines are being stored as big files. Each has a lot in common - the operating system, unused blank space and, in many case, the applications themselves. Disk devices that can do primary storage deduplication would be able to reduce all of this data to a single instance. Regardless of the hypervisor used - VMware, HyperV and so on - there is a huge amount of commonality between each of the virtual machine instances. In fact, it's common to be able to reduce storage requirements in virtual environments by over 80% through deduplication.

Other primary environments, however, don't present a lot of common data, and thus will not benefit from deduplication. What's more, the process of uncovering which blocks of data have been seen before is expensive in both compute resources and I/O bandwidth. Both of those are at a premium in storage array controllers. A knowledgeable designer will typically look at the application type, the data type, and the resources available on the storage array that's doing the dedupe. Once all of these variables are factored together, it's possible to decide if it makes sense to use deduplication on primary storage.

Technique Pros and Cons
While there's a lot to consider when designing a deduplication strategy, a lot of the decisions are fairly nuanced. For example, the two most common techniques for performing deduplication are hashing and delta differencing. Backup appliances use one or the other, or in some instances a hybrid of the two. Which is the preferred technique depends on who you're talking to.

At a high level, hashing and delta differencing are very similar. The net effect of both is that common patterns of data are reduced and you end up with a greatly reduced storage requirement. The difference is in how you determine if there is a common pattern of data. With hashing implementations, the vendors run small blocks of the data through a mathematical algorithm and compute whether they have seen the same data before. This computation theoretically does not offer 100% certainty whether or not a piece of data has been seen before. However, statistically it is almost a certainty - so much so that you'd be more likely to win the mega lottery - dozens of times in a row. The consensus is that this is good enough, and most vendors have used hashing to develop their solutions.

For reasons involving technical implementation, performance tradeoffs, and arguably higher reliability, some vendors have chosen to develop their solutions through delta differencing. With this technology, each small piece of data is actually compared, bit for bit, with everything that has been seen before. This guarantees that the data has or has not already been seen.

Regardless of the implementation used, the odds are more in favor of external failures s power outage, water damage, satellite falling on the data center - than on technologies that determine how data bits are identified as the same. In most deduplication designs it's more important to focus on the features and functionality of the overall solution, rather than this specific level of detail.

Another topic has to do with the timing of deduplication. Inline deduplication processes dedupe backup data in real-time, as it's received at the front end of the Virtual Tape Library (VTL) or Disk-to-Disk (D2D) device. Post-process methods, on the other hand, remove duplicate data after the backup has completed. Regardless of which method is used, the same amount of work is being done.

The question of whether it makes more sense to do inline or post-process deduplication can best be answered by, "it depends." Regardless of when you do it, deduplication is inherently an expensive thing to do in terms of CPU and I/O resources. Choosing between inline and post-process is essentially choosing between paying for the service upfront or after. With some vendors' technologies, you have no choice. You have to use either inline or post. With others you get the choice, although it's something of a black art to figure out when to best use one versus the other.

Typically it comes down to optimizing the speed of ingest (how fast you get the data into the device) with rehydration (how fast you get the data back), and striking a balance between the two. Your best recommendation is to work with someone who has earned the scar tissue from using both these technologies.

Achieving Maximum Efficiency
Now that deduplication is so prevalent, the challenge most of our customers face is identifying which one to use and when. This is particularly difficult since each vendor unequivocally states that their solution is better than everyone else's and is the "one true way." In reality, there are no simple black and white answers and each solution's merits must be weighed individually.

To develop the best possible deduplication solution, it's important to first determine the problem you're trying to solve. Conduct an internal analysis, and then approach a partner who has an unbiased approach to solving the issue at hand. The right partner can help you sort through the hype and identify solutions and best practices that will align with your business needs.

The benefits of deduplication are many. Capital expenses are greatly reduced; you need fewer disks, less tape, and less bandwidth to accomplish the same task. If used appropriately, deduplication will also improve your operational efficiencies, which you can then leverage to reduce your operational expenses.

Simply put, deduplication gives you the ability to do more with less. Whether in networking, primary storage, backup or for data archival protection, a well-designed deduplication solution can help you mitigate the challenges of big data - and keep your IT landscape lean, fast and efficient.

More Stories By Juan Orlandini

A practice manager for Datalink, Juan Orlandini is a 25+ year veteran of the open systems IT industry. Throughout his career, he has been involved in the design and deployment of many large and advanced storage, data protection, and high availability infrastructures.

Juan evaluates next-generation technologies for Datalink and also works with end users, assisting them with architecting and implementing strategic data center architectures. In his current role, he is developing managed services offerings designed to help companies optimize staff productivity and data center efficiency. He continues to evaluate industry solutions, customer needs, and blogs about it at blog.datalink.com

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@MicroservicesExpo Stories
Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more business becomes digital the more stakeholders are interested in this data including how it relates to business. Some of these people have never used a monitoring tool before. They have a question on their mind like “How is my application doing” but no id...
Get deep visibility into the performance of your databases and expert advice for performance optimization and tuning. You can't get application performance without database performance. Give everyone on the team a comprehensive view of how every aspect of the system affects performance across SQL database operations, host server and OS, virtualization resources and storage I/O. Quickly find bottlenecks and troubleshoot complex problems.
In his session at 19th Cloud Expo, Claude Remillard, Principal Program Manager in Developer Division at Microsoft, contrasted how his team used config as code and immutable patterns for continuous delivery of microservices and apps to the cloud. He showed how the immutable patterns helps developers do away with most of the complexity of config as code-enabling scenarios such as rollback, zero downtime upgrades with far greater simplicity. He also demoed building immutable pipelines in the cloud ...
@DevOpsSummit taking place June 6-8, 2017 at Javits Center, New York City, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @DevOpsSummit at Cloud Expo New York Call for Papers is now open.
In IT, we sometimes coin terms for things before we know exactly what they are and how they’ll be used. The resulting terms may capture a common set of aspirations and goals – as “cloud” did broadly for on-demand, self-service, and flexible computing. But such a term can also lump together diverse and even competing practices, technologies, and priorities to the point where important distinctions are glossed over and lost.
Without lifecycle traceability and visibility across the tool chain, stakeholders from Planning-to-Ops have limited insight and answers to who, what, when, why and how across the DevOps lifecycle. This impacts the ability to deliver high quality software at the needed velocity to drive positive business outcomes. In his session at @DevOpsSummit 19th Cloud Expo, Eric Robertson, General Manager at CollabNet, showed how customers are able to achieve a level of transparency that enables everyone fro...
Monitoring of Docker environments is challenging. Why? Because each container typically runs a single process, has its own environment, utilizes virtual networks, or has various methods of managing storage. Traditional monitoring solutions take metrics from each server and applications they run. These servers and applications running on them are typically very static, with very long uptimes. Docker deployments are different: a set of containers may run many applications, all sharing the resource...
Join Impiger for their featured webinar: ‘Cloud Computing: A Roadmap to Modern Software Delivery’ on November 10, 2016, at 12:00 pm CST. Very few companies have not experienced some impact to their IT delivery due to the evolution of cloud computing. This webinar is not about deciding whether you should entertain moving some or all of your IT to the cloud, but rather, a detailed look under the hood to help IT professionals understand how cloud adoption has evolved and what trends will impact th...
Information technology is an industry that has always experienced change, and the dramatic change sweeping across the industry today could not be truthfully described as the first time we've seen such widespread change impacting customer investments. However, the rate of the change, and the potential outcomes from today's digital transformation has the distinct potential to separate the industry into two camps: Organizations that see the change coming, embrace it, and successful leverage it; and...
You have great SaaS business app ideas. You want to turn your idea quickly into a functional and engaging proof of concept. You need to be able to modify it to meet customers' needs, and you need to deliver a complete and secure SaaS application. How could you achieve all the above and yet avoid unforeseen IT requirements that add unnecessary cost and complexity? You also want your app to be responsive in any device at any time. In his session at 19th Cloud Expo, Mark Allen, General Manager of...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
"Dice has been around for the last 20 years. We have been helping tech professionals find new jobs and career opportunities," explained Manish Dixit, VP of Product and Engineering at Dice, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Rapid innovation, changing business landscapes, and new IT demands force businesses to make changes quickly. In the eyes of many, containers are at the brink of becoming a pervasive technology in enterprise IT to accelerate application delivery. In this presentation, attendees learned about the: The transformation of IT to a DevOps, microservices, and container-based architecture What are containers and how DevOps practices can operate in a container-based environment A demonstration of how ...
Application transformation and DevOps practices are two sides of the same coin. Enterprises that want to capture value faster, need to deliver value faster – time value of money principle. To do that enterprises need to build cloud-native apps as microservices by empowering teams to build, ship, and run in production. In his session at @DevOpsSummit at 19th Cloud Expo, Neil Gehani, senior product manager at HPE, discussed what every business should plan for how to structure their teams to delive...
Without lifecycle traceability and visibility across the tool chain, stakeholders from Planning-to-Ops have limited insight and answers to who, what, when, why and how across the DevOps lifecycle. This impacts the ability to deliver high quality software at the needed velocity to drive positive business outcomes. In his general session at @DevOpsSummit at 19th Cloud Expo, Phil Hombledal, Solution Architect at CollabNet, discussed how customers are able to achieve a level of transparency that e...
IT leaders face a monumental challenge. They must figure out how to sort through the cacophony of new technologies, buzzwords, and industry hype to find the right digital path forward for their organizations. And they simply cannot afford to fail. Those organizations that are fastest to the right digital path will be the ones that win. The path forward, however, is strewn with the legacy of decisions made long ago — often before any of the current leadership team assumed their roles. While it’s ...
Kubernetes is a new and revolutionary open-sourced system for managing containers across multiple hosts in a cluster. Ansible is a simple IT automation tool for just about any requirement for reproducible environments. In his session at @DevOpsSummit at 18th Cloud Expo, Patrick Galbraith, a principal engineer at HPE, discussed how to build a fully functional Kubernetes cluster on a number of virtual machines or bare-metal hosts. Also included will be a brief demonstration of running a Galera MyS...
As we enter the final week before the 19th International Cloud Expo | @ThingsExpo in Santa Clara, CA, it's time for me to reflect on six big topics that will be important during the show. Hybrid Cloud: This general-purpose term seems to provide a comfort zone for many enterprise IT managers. It sounds reassuring to be able to work with one of the major public-cloud providers like AWS or Microsoft Azure while still maintaining an on-site presence.
In his general session at 19th Cloud Expo, Manish Dixit, VP of Product and Engineering at Dice, discussed how Dice leverages data insights and tools to help both tech professionals and recruiters better understand how skills relate to each other and which skills are in high demand using interactive visualizations and salary indicator tools to maximize earning potential. Manish Dixit is VP of Product and Engineering at Dice. As the leader of the Product, Engineering and Data Sciences team at D...
Between 2005 and 2020, data volumes will grow by a factor of 300 – enough data to stack CDs from the earth to the moon 162 times. This has come to be known as the ‘big data’ phenomenon. Unfortunately, traditional approaches to handling, storing and analyzing data aren’t adequate at this scale: they’re too costly, slow and physically cumbersome to keep up. Fortunately, in response a new breed of technology has emerged that is cheaper, faster and more scalable. Yet, in meeting these new needs they...