Find White Papers
Home About Contact Help
Free Membership Member Login
Search the Library                  Advanced Search

Evaluating Data Deduplication Solutions: Looking Beyond the Hype

NetApp Data Protection
By : NetApp Data Protection
INFORMATION
Published : Sep 30, 2007
Length : 11
Type : White Paper
 
Download Now
Save for Later
  Email This Page
Overview :

Data deduplication is without a doubt one of the hottest technologies in storage right now. However, with many different deduplication approaches from various vendors - each hyping their own unique benefits - you can get caught up deciding on the factors that are most important for your organization. This white paper goes beyond the hype and highlights the important design aspects of data deduplication to help you make informed decisions.

Download this white paper to learn about the different methodologies used in design deduplication solutions and what you should know about each, including:

  • Hashing
  • Indexing
  • Inline and postprocessing
  • Source and destination
  • Space savings efficiency

 

View All Items By This Company
Browse Related Categories :

Backup And Recovery

,

Data Deduplication

,

Data Management

,

Data Protection

 

Data Deduplication

 

Deduplication is without a doubt one of this year’s hottest topics in data storage. The rationale behind deduplication is simple: Eliminate your duplicate data and reduce the capacity needed during backups and other data copy activities. Unfortunately, the many different deduplication approaches from various vendors, with much hype about their unique benefits, can leave users bewildered. As they consider the variety of deduplication offerings, they often fail to understand the basic design nuances that are important to them.

This paper looks beyond the hype and focuses on the important design aspects of deduplication, giving evaluators the information they need to make informed decisions when examining deduplication solutions.

What Is Deduplication?

Deduplication is the process of “unduplicating” data. The term deduplication was coined by database administrators many years ago as a way of describing the process of removing duplicate database records after two databases have been merged.

In the context of disk storage, deduplication refers to any algorithm that searches for duplicate data objects, such as blocks, chunks, or files, and discards these duplicates. When a duplicate object is detected, its reference pointers are modified so that the object can still be located and retrieved, but it “shares” its physical location with other identical objects. This data sharing is the foundation of all types of data deduplication.

How Does Deduplication Work?

Regardless of operating system, application, or file system type, all data objects are written to a storage system using a data reference pointer, without which the data could not be referenced or retrieved. In traditional (non-deduplicated) file systems, data objects are stored without regard to any similarity with other objects in the same file system. In Figure 1, five identical objects are stored in a file system, each with a separate data pointer. Although all five data objects are identical, each is stored as a separate instance and each consumes physical disk space.

In a deduplicated file system, two new and important concepts are introduced:

 A catalog of all data objects is maintained. This catalog contains a record of all data objects using a “hash” that identifies the unique contents of each object. Hashing is discussed in detail in “Deduplication Design Considerations,” later in this paper.


The file system is capable of allowing many data pointers to reference the same physical data object.


Cataloging data objects, comparing the objects, and redirecting reference pointers forms the basis of the deduplication algorithm. As shown in Figure 2, referencing several identical objects with a single master object allows the space that is normally occupied by the duplicate objects to be given back to the storage system.

Deduplication Design Considerations

Given the fact that all deduplication vendors must maintain some form of catalog and must support some form of block referencing, there is a surprising variety of implementations (and they all have subtle differences that allow them all to be patented). The following sections explain the methods that vendors use when designing deduplication.

Hashing

Data deduplication begins with a comparison of two data objects. It would be impractical (and very arduous) to scan an entire data volume for duplicate objects each time a new object is written to that volume. For that reason, deduplication vendors create small hash values for each new object, and store these values in a catalog.

A hash value, also called a digital fingerprint or digital signature, is a small number that is generated from a longer string of data. A hash value is substantially smaller than the data object itself, and is generated by a mathematical formula in such a way that it is unlikely (although not impossible) for two nonidentical data objects to produce the same hash value.

A hash value can be as simple as a parity calculation or as elaborate as a SHA-1 or MD-5 encryption hash. In any case, once the hash values have been created, they can be easily compared and deduplication candidates can be identified.

 

Search the Library                  Advanced Search
About Us Contact Us List Your Papers Partner With Us Site Map