Click here to close now.




















Welcome!

Microservices Expo Authors: Pat Romanski, Elizabeth White, Trevor Parsons, SmartBear Blog, Ruxit Blog

Related Topics: Microservices Expo, @CloudExpo

Microservices Expo: Article

Deduplication: When, Where and How

Deduplication gives you the ability to do more with less

Nearly every enterprise can benefit from deduplication. Business data has been growing exponentially. Routine backups of that data have become too costly or simply ineffective. Deduplication can help by reducing the cost of primary and secondary storage. Essentially, limited resources are made much more effective and efficient.

What most organizations don't realize is how much deduplication technology has matured. Originally, deduplication was used as an alternative to tape for backup and disaster recovery. This user case continues today and has become one of the predominant solutions for data protection. As it has matured, it has begun to evolve from being a point solution at the end of a backup chain (the target) to a player in every step of the backup process: at the client side, at the network side, at the media server side, as well as at the target device. Backup and storage vendors are implementing this technology in all aspects of their solutions.

Storage vendors have also recognized the efficiencies available by deduping data. In addition to implementing space efficiency technologies in their storage arrays, they're offering deduplication as a way to both improve available capacity and optimize data transmission when replicating data.

Versatile Technology
With these advancements it's possible to leverage deduplication to solve a variety of storage problems. In the data protection space, IT departments face increasing pressure to offer faster backups, even faster restores, and to do them with fewer resources than in the past. Data protection solutions that offer deduplication can, at the very least, significantly reduce the cost of protection to disk - often by more than a 20x reduction.

However, and perhaps more important, recovering lost information from these solutions is typically a lot faster than legacy tape solutions. A properly designed data protection solution that leverages deduplication can often either completely eliminate tape, or relegate tape to an archival medium. In addition, many companies using such a solution are able to replicate all of their backup data from one site to another. This eliminates the need for third-party tape handling and greatly improves the recoverability of the enterprise's data.

For enterprises that employ a replication strategy, deduplication can offer significant efficiencies depending on the data being replicated. If the data has a great deal of repetition or commonality, dedupe can offer tremendous boosts in performance. However, if the data is not very repetitious, deduplication will not offer as great an improvement. For most replication types, enterprises can expect a 2x to 4x reduction in bandwidth requirements.

More and more, storage vendors are offering deduplication on primary storage. Primary storage dedupe is a good idea when the data that is being stored has a lot of commonality - in other words, similar data being stored in one location. A good example of this is virtual environments. In such a situation, virtual machines are being stored as big files. Each has a lot in common - the operating system, unused blank space and, in many case, the applications themselves. Disk devices that can do primary storage deduplication would be able to reduce all of this data to a single instance. Regardless of the hypervisor used - VMware, HyperV and so on - there is a huge amount of commonality between each of the virtual machine instances. In fact, it's common to be able to reduce storage requirements in virtual environments by over 80% through deduplication.

Other primary environments, however, don't present a lot of common data, and thus will not benefit from deduplication. What's more, the process of uncovering which blocks of data have been seen before is expensive in both compute resources and I/O bandwidth. Both of those are at a premium in storage array controllers. A knowledgeable designer will typically look at the application type, the data type, and the resources available on the storage array that's doing the dedupe. Once all of these variables are factored together, it's possible to decide if it makes sense to use deduplication on primary storage.

Technique Pros and Cons
While there's a lot to consider when designing a deduplication strategy, a lot of the decisions are fairly nuanced. For example, the two most common techniques for performing deduplication are hashing and delta differencing. Backup appliances use one or the other, or in some instances a hybrid of the two. Which is the preferred technique depends on who you're talking to.

At a high level, hashing and delta differencing are very similar. The net effect of both is that common patterns of data are reduced and you end up with a greatly reduced storage requirement. The difference is in how you determine if there is a common pattern of data. With hashing implementations, the vendors run small blocks of the data through a mathematical algorithm and compute whether they have seen the same data before. This computation theoretically does not offer 100% certainty whether or not a piece of data has been seen before. However, statistically it is almost a certainty - so much so that you'd be more likely to win the mega lottery - dozens of times in a row. The consensus is that this is good enough, and most vendors have used hashing to develop their solutions.

For reasons involving technical implementation, performance tradeoffs, and arguably higher reliability, some vendors have chosen to develop their solutions through delta differencing. With this technology, each small piece of data is actually compared, bit for bit, with everything that has been seen before. This guarantees that the data has or has not already been seen.

Regardless of the implementation used, the odds are more in favor of external failures s power outage, water damage, satellite falling on the data center - than on technologies that determine how data bits are identified as the same. In most deduplication designs it's more important to focus on the features and functionality of the overall solution, rather than this specific level of detail.

Another topic has to do with the timing of deduplication. Inline deduplication processes dedupe backup data in real-time, as it's received at the front end of the Virtual Tape Library (VTL) or Disk-to-Disk (D2D) device. Post-process methods, on the other hand, remove duplicate data after the backup has completed. Regardless of which method is used, the same amount of work is being done.

The question of whether it makes more sense to do inline or post-process deduplication can best be answered by, "it depends." Regardless of when you do it, deduplication is inherently an expensive thing to do in terms of CPU and I/O resources. Choosing between inline and post-process is essentially choosing between paying for the service upfront or after. With some vendors' technologies, you have no choice. You have to use either inline or post. With others you get the choice, although it's something of a black art to figure out when to best use one versus the other.

Typically it comes down to optimizing the speed of ingest (how fast you get the data into the device) with rehydration (how fast you get the data back), and striking a balance between the two. Your best recommendation is to work with someone who has earned the scar tissue from using both these technologies.

Achieving Maximum Efficiency
Now that deduplication is so prevalent, the challenge most of our customers face is identifying which one to use and when. This is particularly difficult since each vendor unequivocally states that their solution is better than everyone else's and is the "one true way." In reality, there are no simple black and white answers and each solution's merits must be weighed individually.

To develop the best possible deduplication solution, it's important to first determine the problem you're trying to solve. Conduct an internal analysis, and then approach a partner who has an unbiased approach to solving the issue at hand. The right partner can help you sort through the hype and identify solutions and best practices that will align with your business needs.

The benefits of deduplication are many. Capital expenses are greatly reduced; you need fewer disks, less tape, and less bandwidth to accomplish the same task. If used appropriately, deduplication will also improve your operational efficiencies, which you can then leverage to reduce your operational expenses.

Simply put, deduplication gives you the ability to do more with less. Whether in networking, primary storage, backup or for data archival protection, a well-designed deduplication solution can help you mitigate the challenges of big data - and keep your IT landscape lean, fast and efficient.

More Stories By Juan Orlandini

A practice manager for Datalink, Juan Orlandini is a 25+ year veteran of the open systems IT industry. Throughout his career, he has been involved in the design and deployment of many large and advanced storage, data protection, and high availability infrastructures.

Juan evaluates next-generation technologies for Datalink and also works with end users, assisting them with architecting and implementing strategic data center architectures. In his current role, he is developing managed services offerings designed to help companies optimize staff productivity and data center efficiency. He continues to evaluate industry solutions, customer needs, and blogs about it at blog.datalink.com

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@MicroservicesExpo Stories
SYS-CON Events announced today that HPM Networks will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. For 20 years, HPM Networks has been integrating technology solutions that solve complex business challenges. HPM Networks has designed solutions for both SMB and enterprise customers throughout the San Francisco Bay Area.
Whether you like it or not, DevOps is on track for a remarkable alliance with security. The SEC didn’t approve the merger. And your boss hasn’t heard anything about it. Yet, this unruly triumvirate will soon dominate and deliver DevSecOps faster, cheaper, better, and on an unprecedented scale. In his session at DevOps Summit, Frank Bunger, VP of Customer Success at ScriptRock, will discuss how this cathartic moment will propel the DevOps movement from such stuff as dreams are made on to a prac...
SYS-CON Events announced today that Pythian, a global IT services company specializing in helping companies leverage disruptive technologies to optimize revenue-generating systems, has been named “Bronze Sponsor” of SYS-CON's 17th Cloud Expo, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Founded in 1997, Pythian is a global IT services company that helps companies compete by adopting disruptive technologies such as cloud, Big Data, advance...
The pricing of tools or licenses for log aggregation can have a significant effect on organizational culture and the collaboration between Dev and Ops teams. Modern tools for log aggregation (of which Logentries is one example) can be hugely enabling for DevOps approaches to building and operating business-critical software systems. However, the pricing of an aggregated logging solution can affect the adoption of modern logging techniques, as well as organizational capabilities and cross-team ...
Culture is the most important ingredient of DevOps. The challenge for most organizations is defining and communicating a vision of beneficial DevOps culture for their organizations, and then facilitating the changes needed to achieve that. Often this comes down to an ability to provide true leadership. As a CIO, are your direct reports IT managers or are they IT leaders? The hard truth is that many IT managers have risen through the ranks based on their technical skills, not their leadership ab...
In today's digital world, change is the one constant. Disruptive innovations like cloud, mobility, social media, and the Internet of Things have reshaped the market and set new standards in customer expectations. To remain competitive, businesses must tap the potential of emerging technologies and markets through the rapid release of new products and services. However, the rigid and siloed structures of traditional IT platforms and processes are slowing them down – resulting in lengthy delivery ...
Several years ago, I was a developer in a travel reservation aggregator. Our mission was to pull flight and hotel data from a bunch of cryptic reservation platforms, and provide it to other companies via an API library - for a fee. That was before companies like Expedia standardized such things. We started with simple methods like getFlightLeg() or addPassengerName(), each performing a small, well-understood function. But our customers wanted bigger, more encompassing services that would "do ...
Docker containerization is increasingly being used in production environments. How can these environments best be monitored? Monitoring Docker containers as if they are lightweight virtual machines (i.e., monitoring the host from within the container), with all the common metrics that can be captured from an operating system, is an insufficient approach. Docker containers can’t be treated as lightweight virtual machines; they must be treated as what they are: isolated processes running on hosts....
SYS-CON Events announced today the Containers & Microservices Bootcamp, being held November 3-4, 2015, in conjunction with 17th Cloud Expo, @ThingsExpo, and @DevOpsSummit at the Santa Clara Convention Center in Santa Clara, CA. This is your chance to get started with the latest technology in the industry. Combined with real-world scenarios and use cases, the Containers and Microservices Bootcamp, led by Janakiram MSV, a Microsoft Regional Director, will include presentations as well as hands-on...
DevOps has traditionally played important roles in development and IT operations, but the practice is quickly becoming core to other business functions such as customer success, business intelligence, and marketing analytics. Modern marketers today are driven by data and rely on many different analytics tools. They need DevOps engineers in general and server log data specifically to do their jobs well. Here’s why: Server log files contain the only data that is completely full and accurate in th...
Skeuomorphism usually means retaining existing design cues in something new that doesn’t actually need them. However, the concept of skeuomorphism can be thought of as relating more broadly to applying existing patterns to new technologies that, in fact, cry out for new approaches. In his session at DevOps Summit, Gordon Haff, Senior Cloud Strategy Marketing and Evangelism Manager at Red Hat, discussed why containers should be paired with new architectural practices such as microservices rathe...
SYS-CON Events announced today that G2G3 will exhibit at SYS-CON's @DevOpsSummit Silicon Valley, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Based on a collective appreciation for user experience, design, and technology, G2G3 is uniquely qualified and motivated to redefine how organizations and people engage in an increasingly digital world.
Any Ops team trying to support a company in today’s cloud-connected world knows that a new way of thinking is required – one just as dramatic than the shift from Ops to DevOps. The diversity of modern operations requires teams to focus their impact on breadth vs. depth. In his session at DevOps Summit, Adam Serediuk, Director of Operations at xMatters, Inc., will discuss the strategic requirements of evolving from Ops to DevOps, and why modern Operations has begun leveraging the “NoOps” approa...
Puppet Labs has announced the next major update to its flagship product: Puppet Enterprise 2015.2. This release includes new features providing DevOps teams with clarity, simplicity and additional management capabilities, including an all-new user interface, an interactive graph for visualizing infrastructure code, a new unified agent and broader infrastructure support.
Early in my DevOps Journey, I was introduced to a book of great significance circulating within the Web Operations industry titled The Phoenix Project. (You can read our review of Gene’s book, if interested.) Written as a novel and loosely based on many of the same principles explored in The Goal, this book has been read and referenced by many who have adopted DevOps into their continuous improvement and software delivery processes around the world. As I began planning my travel schedule last...
SYS-CON Events announced today that DataClear Inc. will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. The DataClear ‘BlackBox’ is the only solution that moves your PC, browsing and data out of the United States and away from prying (and spying) eyes. Its solution automatically builds you a clean, on-demand, virus free, new virtual cloud based PC outside of the United States, and wipes it clean...
In his session at 17th Cloud Expo, Ernest Mueller, Product Manager at Idera, will explain the best practices and lessons learned for tracking and optimizing costs while delivering a cloud-hosted service. He will describe a DevOps approach where the applications and systems work together to track usage, model costs in a granular fashion, and make smart decisions at runtime to minimize costs. The trickier parts covered include triggering off the right metrics; balancing resilience and redundancy ...
It’s been proven time and time again that in tech, diversity drives greater innovation, better team productivity and greater profits and market share. So what can we do in our DevOps teams to embrace diversity and help transform the culture of development and operations into a true “DevOps” team? In her session at DevOps Summit, Stefana Muller, Director, Product Management – Continuous Delivery at CA Technologies, answered that question citing examples, showing how to create opportunities for ...
What does “big enough” mean? It’s sometimes useful to argue by reductio ad absurdum. Hello, world doesn’t need to be broken down into smaller services. At the other extreme, building a monolithic enterprise resource planning (ERP) system is just asking for trouble: it’s too big, and it needs to be decomposed.
The Microservices architectural pattern promises increased DevOps agility and can help enable continuous delivery of software. This session is for developers who are transforming existing applications to cloud-native applications, or creating new microservices style applications. In his session at DevOps Summit, Jim Bugwadia, CEO of Nirmata, will introduce best practices, patterns, challenges, and solutions for the development and operations of microservices style applications. He will discuss ...