Find White Papers
Home About Contact Help
Free Membership Member Login
Search the Library                  Advanced Search

Comparing Deduplication Approaches: Technology Considerations for Enterprise Environments

SEPATON
By : SEPATON
INFORMATION
Published : Jun 19, 2008
Length : 8
Type : White Paper
 
Download Now
Save for Later
  Email This Page
Overview :

Deduplication is becoming an essential tool to help data center managers control exponential data growth in the backup environment. The methods used to accomplish deduplication vary widely as do the levels of capacity optimization they can provide. Some techniques are well suited to small-to-medium sized backup environments, while others are optimized for larger enterprises.

Download this report and understand the various techniques used today to deduplicate data and unique deduplication considerations for enterprise environments.

View All Items By This Company
Browse Related Categories :

Data Deduplication

,

Small Business Networks

,

Storage

,

Storage Management

 
The volume of data generated by companies today is growing explosively. More powerful computing technology and the evolution to an information-based economy are causing companies to generate more data than ever before. The process of backing up all of this data leads to a completely new set of challenges. Companies typically backup the same data many times over its lifecycle. As a result, a single terabyte of new data can require 50 to 60 times that capacity to store it over its lifetime.
In addition, laws such as Health Information Portability and Accountability Act, and Sarbanes-Oxley require some types of data to be store for many years. They also require companies to be able to retrieve that data quickly and completely upon request. To deal with this overwhelming data growth and related storage requirements, many companies are evaluating the use of data deduplication technology. Data deduplication technology is software that compares data in new backup streams to data that has already been stored to identify and remove duplicates. For example, if only 5% of the data in a current backup stream has changed since the previous backup, the deduplication technology will only store that 5%. A record is kept of the duplicate data so the files can be reassembled for data restores.
Virtual tape libraries provide a level of performance and reliability that traditional physical tape systems cannot approximate. VTLs enable companies to back up data many times faster than tape, restore data quickly, and eliminate a variety of time-consuming manual tasks. However, without data deduplication, the cost of disk is higher than that of tape, forcing companies to use disk space carefully by keeping online retention times short and moving data to tape archive as quickly as possible. With data deduplication, this process is not necessary. When used with hardware compression on a virtual tape library (VTL), deduplication can deliver as much as 50:1 capacity reduction, making disk-based secondary storage and longer online data retention times cost-effective for the enterprise. The methods used to accomplish deduplication vary widely as do the levels of capacity optimization they can provide. Some techniques are well-suited to small-to-medium sized backup environments and others are optimal for enterprise-class environments. This article will describe the techniques being used today to deduplicate data on VTLs. It will summarize the backup environments and data protection objectives for which each technology is best suited.
The fundamental function of all deduplication is to compare the data in a backup set to the data that has already been stored to identify and eliminate the storage of duplicates. Performing this comparison at too granular a level—comparing every bit of backup data to every bit of previously stored data—would yield excellent deduplication results, but would be too time consuming and process-intensive to be feasible. Comparing data at too gross a level would be faster, but would miss a significant amount of duplicate data. There are two general ways that deduplication technologies solve this dilemma— hash-based comparison and the ContentAware comparison used in the SEPATON DeltaStor® deduplication software on an S2100®-ES2 virtual tape library (VTL).
The hash-based approach runs incoming data through an algorithm that assigns a unique number (called a hash) to every chunk of data. It then compares the new hashes to those that have already been stored in a lookup table. If the new hash does not match, then it stores the corresponding chunk of data and adds the new hash to the lookup table. If the new hash does match one in the lookup table, it stops the corresponding chunk of data from being backed up and records the composition of the chunk so that it can be reconstituted for restores. To restore data, it assembles the chunks of stored data into full files. Over time, backups are broken into more and more chunks of data that are scattered on the disk. As a result, restoring files is processing intensive and time-consuming.
Search the Library                  Advanced Search
About Us Contact Us List Your Papers Partner With Us Site Map