|
Since June 2005, when Diligent launched ProtecTIER, Diligent’s Enterprise de-duplication platform,we have accumulated a great deal of knowledge about this newly emerging discipline. Specifically,we have become domain experts in the process of evaluating, deploying and operating de-duplication solutions in very large environments.2 We have applied a methodical, transparent, diligent approach and have learned from our own mistakes, as well as through the validation and refutation of various conceptions and misconceptions about de-duplication. Enterprise storage operates in a mission-critical environment, time and resources are scarce, and there is a lot of hyperbole that must be sorted through.The goal of this document is to help you seek the pertinent information from vendors, as the basis for sound decision-making. Diligent is an enterprise–focused de-duplication vendor.Having launched its flagship de-duplication solution, ProtecTIER, in mid-2005, Diligent now has more than two years of experience deploying and supporting de-duplication solutions with enterprise accounts.Through this, Diligent has developed an extensive knowledge base about what it takes to properly plan for and deploy a de-duplication solution into a given environment. Diligent was founded in 2002 by ex-EMC executives,Moshe Yanai and Doron Kempel. From the beginning, the company focused on launching a de-duplication platform for the high-end customer environment. The team believed that addressing the de-duplication needs of large data-centers would require a unique algorithm. Six mathematicians were tasked with research into a massively scalable de-duplication algorithm that would combine the following attributes: (1) inline de-duplication throughput of 400MB/s per system; (2) storage scalability of up to 1PB per single system; (3) fine-grain source-agnostic de-duplication; 4) 100% data-integrity. By mid 2004, the mathematicians accomplished their mission.The new algorithm was named HyperFactor. Broadly defined, de-duplication is a method for finding and eliminating redundant data from the network and/or storage infrastructure. The great benefit of de-duplication technology is that it can dramatically increase the effective capacity of a given storage pool.This directly translates to hundreds of thousands or millions of dollars saved in the cost of storage. But, harnessing this promised value is not always straightforward. There are a wide range of de-duplication offerings on the market today, and these solutions vary greatly in their underlying de-duplication algorithms, system architecture, functional use-case, strengths and weaknesses. Because of these differences — and the associated risks — Diligent strongly encourages end-users to consider a range of factors beyond the ability to de-duplicate data when they are evaluating solutions. In speaking with our customers over the past several years,we have identified a list of key criteria that an enterprise-class solution must have.The table on the next page outlines these key criteria and their importance. The capacity expansion effect of de-duplication is often expressed as a de-duplication ratio. In essence, the de-duplication ratio is the ratio of nominal data (the sum of all User Data backup streams) to the physical storage used.This ratio can grow to 10:1, 20:1, 30:1 or more. A 10:1 ratio means that 10 times more nominal data is managed by the system than the physical space required to store it. Further in this document,we shed light on some of the confusion and hype surrounding de-duplication ratios and the fact that some vendors apply different definitions that, on first sight,may seem dramatically superior — but in essence are not. The realized de-duplication ratio for a given customer depends heavily on three key variables: the Data Retention Period, the Data Change Rate, and the Backup Practice. Each of these factors is defined below: 1. Data Retention Period: The period of time (usually measured in days) which defines how long customers will keep their backed up data on the disk-based de-duplication system.This period of time typically ranges from a period of 30 to 90 days, but can be much longer. Figure 1 demonstrates the impact of data retention on the factoring ratio. 2. Data Change Rate: The rate at which the data received from the backup application changes from backup to backup.This measurement has most relevance when “like” backup policies are compared. (Data change rates typically range from 1% to >25%, but is difficult to directly observe).The smaller the change rate, the higher the de-duplication opportunity (to eliminate the duplicate data that has not changed in reference to data that is already stored on the system).
|