Welcome!

Microservices Expo Authors: Lori MacVittie, Sematext Blog, Liz McMillan, Elizabeth White, Daniel Khan

Related Topics: Microservices Expo, @CloudExpo

Microservices Expo: Article

Deduplication: When, Where and How

Deduplication gives you the ability to do more with less

Nearly every enterprise can benefit from deduplication. Business data has been growing exponentially. Routine backups of that data have become too costly or simply ineffective. Deduplication can help by reducing the cost of primary and secondary storage. Essentially, limited resources are made much more effective and efficient.

What most organizations don't realize is how much deduplication technology has matured. Originally, deduplication was used as an alternative to tape for backup and disaster recovery. This user case continues today and has become one of the predominant solutions for data protection. As it has matured, it has begun to evolve from being a point solution at the end of a backup chain (the target) to a player in every step of the backup process: at the client side, at the network side, at the media server side, as well as at the target device. Backup and storage vendors are implementing this technology in all aspects of their solutions.

Storage vendors have also recognized the efficiencies available by deduping data. In addition to implementing space efficiency technologies in their storage arrays, they're offering deduplication as a way to both improve available capacity and optimize data transmission when replicating data.

Versatile Technology
With these advancements it's possible to leverage deduplication to solve a variety of storage problems. In the data protection space, IT departments face increasing pressure to offer faster backups, even faster restores, and to do them with fewer resources than in the past. Data protection solutions that offer deduplication can, at the very least, significantly reduce the cost of protection to disk - often by more than a 20x reduction.

However, and perhaps more important, recovering lost information from these solutions is typically a lot faster than legacy tape solutions. A properly designed data protection solution that leverages deduplication can often either completely eliminate tape, or relegate tape to an archival medium. In addition, many companies using such a solution are able to replicate all of their backup data from one site to another. This eliminates the need for third-party tape handling and greatly improves the recoverability of the enterprise's data.

For enterprises that employ a replication strategy, deduplication can offer significant efficiencies depending on the data being replicated. If the data has a great deal of repetition or commonality, dedupe can offer tremendous boosts in performance. However, if the data is not very repetitious, deduplication will not offer as great an improvement. For most replication types, enterprises can expect a 2x to 4x reduction in bandwidth requirements.

More and more, storage vendors are offering deduplication on primary storage. Primary storage dedupe is a good idea when the data that is being stored has a lot of commonality - in other words, similar data being stored in one location. A good example of this is virtual environments. In such a situation, virtual machines are being stored as big files. Each has a lot in common - the operating system, unused blank space and, in many case, the applications themselves. Disk devices that can do primary storage deduplication would be able to reduce all of this data to a single instance. Regardless of the hypervisor used - VMware, HyperV and so on - there is a huge amount of commonality between each of the virtual machine instances. In fact, it's common to be able to reduce storage requirements in virtual environments by over 80% through deduplication.

Other primary environments, however, don't present a lot of common data, and thus will not benefit from deduplication. What's more, the process of uncovering which blocks of data have been seen before is expensive in both compute resources and I/O bandwidth. Both of those are at a premium in storage array controllers. A knowledgeable designer will typically look at the application type, the data type, and the resources available on the storage array that's doing the dedupe. Once all of these variables are factored together, it's possible to decide if it makes sense to use deduplication on primary storage.

Technique Pros and Cons
While there's a lot to consider when designing a deduplication strategy, a lot of the decisions are fairly nuanced. For example, the two most common techniques for performing deduplication are hashing and delta differencing. Backup appliances use one or the other, or in some instances a hybrid of the two. Which is the preferred technique depends on who you're talking to.

At a high level, hashing and delta differencing are very similar. The net effect of both is that common patterns of data are reduced and you end up with a greatly reduced storage requirement. The difference is in how you determine if there is a common pattern of data. With hashing implementations, the vendors run small blocks of the data through a mathematical algorithm and compute whether they have seen the same data before. This computation theoretically does not offer 100% certainty whether or not a piece of data has been seen before. However, statistically it is almost a certainty - so much so that you'd be more likely to win the mega lottery - dozens of times in a row. The consensus is that this is good enough, and most vendors have used hashing to develop their solutions.

For reasons involving technical implementation, performance tradeoffs, and arguably higher reliability, some vendors have chosen to develop their solutions through delta differencing. With this technology, each small piece of data is actually compared, bit for bit, with everything that has been seen before. This guarantees that the data has or has not already been seen.

Regardless of the implementation used, the odds are more in favor of external failures s power outage, water damage, satellite falling on the data center - than on technologies that determine how data bits are identified as the same. In most deduplication designs it's more important to focus on the features and functionality of the overall solution, rather than this specific level of detail.

Another topic has to do with the timing of deduplication. Inline deduplication processes dedupe backup data in real-time, as it's received at the front end of the Virtual Tape Library (VTL) or Disk-to-Disk (D2D) device. Post-process methods, on the other hand, remove duplicate data after the backup has completed. Regardless of which method is used, the same amount of work is being done.

The question of whether it makes more sense to do inline or post-process deduplication can best be answered by, "it depends." Regardless of when you do it, deduplication is inherently an expensive thing to do in terms of CPU and I/O resources. Choosing between inline and post-process is essentially choosing between paying for the service upfront or after. With some vendors' technologies, you have no choice. You have to use either inline or post. With others you get the choice, although it's something of a black art to figure out when to best use one versus the other.

Typically it comes down to optimizing the speed of ingest (how fast you get the data into the device) with rehydration (how fast you get the data back), and striking a balance between the two. Your best recommendation is to work with someone who has earned the scar tissue from using both these technologies.

Achieving Maximum Efficiency
Now that deduplication is so prevalent, the challenge most of our customers face is identifying which one to use and when. This is particularly difficult since each vendor unequivocally states that their solution is better than everyone else's and is the "one true way." In reality, there are no simple black and white answers and each solution's merits must be weighed individually.

To develop the best possible deduplication solution, it's important to first determine the problem you're trying to solve. Conduct an internal analysis, and then approach a partner who has an unbiased approach to solving the issue at hand. The right partner can help you sort through the hype and identify solutions and best practices that will align with your business needs.

The benefits of deduplication are many. Capital expenses are greatly reduced; you need fewer disks, less tape, and less bandwidth to accomplish the same task. If used appropriately, deduplication will also improve your operational efficiencies, which you can then leverage to reduce your operational expenses.

Simply put, deduplication gives you the ability to do more with less. Whether in networking, primary storage, backup or for data archival protection, a well-designed deduplication solution can help you mitigate the challenges of big data - and keep your IT landscape lean, fast and efficient.

More Stories By Juan Orlandini

A practice manager for Datalink, Juan Orlandini is a 25+ year veteran of the open systems IT industry. Throughout his career, he has been involved in the design and deployment of many large and advanced storage, data protection, and high availability infrastructures.

Juan evaluates next-generation technologies for Datalink and also works with end users, assisting them with architecting and implementing strategic data center architectures. In his current role, he is developing managed services offerings designed to help companies optimize staff productivity and data center efficiency. He continues to evaluate industry solutions, customer needs, and blogs about it at blog.datalink.com

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@MicroservicesExpo Stories
Sharding has become a popular means of achieving scalability in application architectures in which read/write data separation is not only possible, but desirable to achieve new heights of concurrency. The premise is that by splitting up read and write duties, it is possible to get better overall performance at the cost of a slight delay in consistency. That is, it takes a bit of time to replicate changes initiated by a "write" to the read-only master database. It's eventually consistent, and it'...
Node.js and io.js are increasingly being used to run JavaScript on the server side for many types of applications, such as websites, real-time messaging and controllers for small devices with limited resources. For DevOps it is crucial to monitor the whole application stack and Node.js is rapidly becoming an important part of the stack in many organizations. Sematext has historically had a strong support for monitoring big data applications such as Elastic (aka Elasticsearch), Cassandra, Solr, S...
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo Silicon Valley Call for Papers is now open.
There's a lot of things we do to improve the performance of web and mobile applications. We use caching. We use compression. We offload security (SSL and TLS) to a proxy with greater compute capacity. We apply image optimization and minification to content. We do all that because performance is king. Failure to perform can be, for many businesses, equivalent to an outage with increased abandonment rates and angry customers taking to the Internet to express their extreme displeasure.
Ovum, a leading technology analyst firm, has published an in-depth report, Ovum Decision Matrix: Selecting a DevOps Release Management Solution, 2016–17. The report focuses on the automation aspects of DevOps, Release Management and compares solutions from the leading vendors.
No matter how well-built your applications are, countless issues can cause performance problems, putting the platforms they are running on under scrutiny. If you've moved to Node.js to power your applications, you may be at risk of these issues calling your choice into question. How do you identify vulnerabilities and mitigate risk to take the focus off troubleshooting the technology and back where it belongs, on innovation? There is no doubt that Node.js is one of today's leading platforms of ...
Adding public cloud resources to an existing application can be a daunting process. The tools that you currently use to manage the software and hardware outside the cloud aren’t always the best tools to efficiently grow into the cloud. All of the major configuration management tools have cloud orchestration plugins that can be leveraged, but there are also cloud-native tools that can dramatically improve the efficiency of managing your application lifecycle. In his session at 18th Cloud Expo, ...
SYS-CON Events announced today that LeaseWeb USA, a cloud Infrastructure-as-a-Service (IaaS) provider, will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. LeaseWeb is one of the world's largest hosting brands. The company helps customers define, develop and deploy IT infrastructure tailored to their exact business needs, by combining various kinds cloud solutions.
SYS-CON Events announced today that Venafi, the Immune System for the Internet™ and the leading provider of Next Generation Trust Protection, will exhibit at @DevOpsSummit at 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Venafi is the Immune System for the Internet™ that protects the foundation of all cybersecurity – cryptographic keys and digital certificates – so they can’t be misused by bad guys in attacks...
DevOps at Cloud Expo – being held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Am...

Let's just nip the conflation of these terms in the bud, shall we?

"MIcro" is big these days. Both microservices and microsegmentation are having and will continue to have an impact on data center architecture, but not necessarily for the same reasons. There's a growing trend in which folks - particularly those with a network background - conflate the two and use them to mean the same thing.

They are not.

One is about the application. The other, the network. T...

The 19th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Digital Transformation, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportuni...
If you are within a stones throw of the DevOps marketplace you have undoubtably noticed the growing trend in Microservices. Whether you have been staying up to date with the latest articles and blogs or you just read the definition for the first time, these 5 Microservices Resources You Need In Your Life will guide you through the ins and outs of Microservices in today’s world.
Before becoming a developer, I was in the high school band. I played several brass instruments - including French horn and cornet - as well as keyboards in the jazz stage band. A musician and a nerd, what can I say? I even dabbled in writing music for the band. Okay, mostly I wrote arrangements of pop music, so the band could keep the crowd entertained during Friday night football games. What struck me then was that, to write parts for all the instruments - brass, woodwind, percussion, even k...
This digest provides an overview of good resources that are well worth reading. We’ll be updating this page as new content becomes available, so I suggest you bookmark it. Also, expect more digests to come on different topics that make all of our IT-hearts go boom!
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor – all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
SYS-CON Events announced today that Isomorphic Software will exhibit at DevOps Summit at 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Isomorphic Software provides the SmartClient HTML5/AJAX platform, the most advanced technology for building rich, cutting-edge enterprise web applications for desktop and mobile. SmartClient combines the productivity and performance of traditional desktop software with the simp...
Right off the bat, Newman advises that we should "think of microservices as a specific approach for SOA in the same way that XP or Scrum are specific approaches for Agile Software development". These analogies are very interesting because my expectation was that microservices is a pattern. So I might infer that microservices is a set of process techniques as opposed to an architectural approach. Yet in the book, Newman clearly includes some elements of concept model and architecture as well as p...
This is a no-hype, pragmatic post about why I think you should consider architecting your next project the way SOA and/or microservices suggest. No matter if it’s a greenfield approach or if you’re in dire need of refactoring. Please note: considering still keeps open the option of not taking that approach. After reading this, you will have a better idea about whether building multiple small components instead of a single, large component makes sense for your project. This post assumes that you...
In his session at @DevOpsSummit at 19th Cloud Expo, Yoseph Reuveni, Director of Software Engineering at Jet.com, will discuss Jet.com's journey into containerizing Microsoft-based technologies like C# and F# into Docker. He will talk about lessons learned and challenges faced, the Mono framework tryout and how they deployed everything into Azure cloud. Yoseph Reuveni is a technology leader with unique experience developing and running high throughput (over 1M tps) distributed systems with extre...