|
Everyone knows that the capacity of storage systems is going up at a breathtaking pace. In the last 10 years, NetApp has gone from shipping storage systems with tens of gigabytes to hundreds of terabytes, an astonishing 10,000-fold increase. Most businesses, however, find that their appetite for storage has grown even faster and – in addition to the costs of disk or tape to store all this data – data center space and power are increasingly expensive. Using storage as efficiently as possible is therefore a critical objective. NetApp has long been an industry leader in efficient storage utilization, from its unique incremental-only Snapshot™ technology, which requires minimal disk space to store hundreds of Snapshot copies, to FlexVol® technology, which enables sys admins to expand and contract volumes on the fly. In May, NetApp announced a new deduplication technology that can significantly increase the amount of data stored in a set amount of disk space: Advanced Single Instance Storage (A-SIS) deduplication. This technology is available (at no charge!) for NetApp NearStore® R200 and NearStore on FAS systems. Deduplication improves efficiency by finding identical blocks of data and replacing them with references to a single shared block. The same block of data can belong to several different files or LUNs, or it can appear repeatedly within the same file. A-SIS deduplication is an integral part of the NetApp WAFL file system, which manages all storage on NetApp FAS systems. As a result, deduplication works “behind the scenes,” regardless of what applications you run or how you access the data, and its overhead is low. How much space can you save? It depends on the data set and the amount of duplication it contains. Here are a couple of examples of the savings that NetApp customers have seen: - A global oil and gas company achieved a 35% space savings for its home directory storage. - An investment management company reduced backups copies of their VMware images by 90%. - A test and measurements manufacturer realized a 98% space savings on daily database backups.
How A-SIS Deduplication Works At its heart, A-SIS deduplication relies on the time-honored computer science technique of reference counting. Previously, WAFL kept track only of whether a block was free or in use. With A-SIS deduplication, it also keeps track of how many uses there are. In the current implementation, a single WAFL block can be referenced up to 256 times in different files or within the same file. Files don’t “know” that they are sharing their data – bookkeeping within WAFL takes care of the details invisibly. How does WAFL decide that two blocks can be shared? The answer is that for each block, it computes a “fingerprint,” which is a hash of the block’s data. Two blocks that have the same fingerprint are candidates for sharing. When A-SIS deduplication is enabled on a volume, it computes a database of fingerprints for all of the in-use blocks in the volume (a process known as “gathering”). Once this initial setup is finished, the volume is ready for deduplication. To avoid slowing down ordinary file operations, the search for duplicates is done as a separate batch process. As the file system gets updated during normal use, WAFL creates a log describing the changes to its data blocks. This log accumulates until one of the following occurs: - The administrator issues a sis start command - The next time specified in the sis config schedule occurs - The changes to the log exceed a predetermined threshold Any of these events will trigger the deduplication process. Once the deduplication process is started, A-SIS sorts the log using the fingerprints of the changed blocks as a key, and then merges the sorted list with the fingerprint database file. Whenever the same fingerprint appears in both lists, there are possibly identical blocks that can be collapsed into one. In this case, WAFL can discard one of the blocks and replace it with a reference to the other block. Since the file system is changing all the time, we of course can take this step only if both blocks are really still in use and contain the same data.
|