Data Deduplication
Deduplication is without a doubt one of this year’s hottest topics in data storage. The rationale behind deduplication is simple: Eliminate your duplicate data and reduce the capacity needed during backups and other data copy activities. Unfortunately, the many different deduplication approaches from various vendors, with much hype about their unique benefits, can leave users bewildered. As they consider the variety of deduplication offerings, they often fail to understand the basic design nuances that are important to them.
This paper looks beyond the hype and focuses on the important design aspects of deduplication, giving evaluators the information they need to make informed decisions when examining deduplication solutions.
What Is Deduplication?
Deduplication is the process of “unduplicating” data. The term deduplication was coined by database administrators many years ago as a way of describing the process of removing duplicate database records after two databases have been merged.
In the context of disk storage, deduplication refers to any algorithm that searches for duplicate data objects, such as blocks, chunks, or files, and discards these duplicates. When a duplicate object is detected, its reference pointers are modified so that the object can still be located and retrieved, but it “shares” its physical location with other identical objects. This data sharing is the foundation of all types of data deduplication.
How Does Deduplication Work?
Regardless of operating system, application, or file system type, all data objects are written to a storage system using a data reference pointer, without which the data could not be referenced or retrieved. In traditional (non-deduplicated) file systems, data objects are stored without regard to any similarity with other objects in the same file system. In Figure 1, five identical objects are stored in a file system, each with a separate data pointer. Although all five data objects are identical, each is stored as a separate instance and each consumes physical disk space.
In a deduplicated file system, two new and important concepts are introduced:
A catalog of all data objects is maintained. This catalog contains a record of all data objects using a “hash” that identifies the unique contents of each object. Hashing is discussed in detail in “Deduplication Design Considerations,” later in this paper.
The file system is capable of allowing many data pointers to reference the same physical data object.
Cataloging data objects, comparing the objects, and redirecting reference pointers forms the basis of the deduplication algorithm. As shown in Figure 2, referencing several identical objects with a single master object allows the space that is normally occupied by the duplicate objects to be given back to the storage system.
Deduplication Design Considerations
Given the fact that all deduplication vendors must maintain some form of catalog and must support some form of block referencing, there is a surprising variety of implementations (and they all have subtle differences that allow them all to be patented). The following sections explain the methods that vendors use when designing deduplication.
Hashing
Data deduplication begins with a comparison of two data objects. It would be impractical (and very arduous) to scan an entire data volume for duplicate objects each time a new object is written to that volume. For that reason, deduplication vendors create small hash values for each new object, and store these values in a catalog.
A hash value, also called a digital fingerprint or digital signature, is a small number that is generated from a longer string of data. A hash value is substantially smaller than the data object itself, and is generated by a mathematical formula in such a way that it is unlikely (although not impossible) for two nonidentical data objects to produce the same hash value.
A hash value can be as simple as a parity calculation or as elaborate as a SHA-1 or MD-5 encryption hash. In any case, once the hash values have been created, they can be easily compared and deduplication candidates can be identified.