Saturday, May 31, 2008

Deduplication anyone?

De duplication is one of the hot topics in storage world. With tons of vendors offering de dup products and biggies like Netapp offering integrated de dup solutions with their NAS products, the competition is fierce. How does du duplication help when actually the consumer is trying to keep redundant data in order to facilitate disaster recovery?
Effectively de dup is doing opposite of what RAID, replication, snap shots do. This is not exactly how it sounds. The de dup essentially tries to take a whole different approach in order to save disk space on a file system. The granularity of de dup could be a file or a file system block. If done at file level, it would de dup less data since very few times are two files are entirely identical. But blocks could be identical very often and it would definitely save more space. We would discuss block based de dup here.
- De dup calculates a kind of identity signature for each block on the file system and stores it with a data base. Now the blocks containing the same data will generate same signature and can be detected to be a duplicate of an existing block. A cryptographic hash algorithm like MD5 or SHA1 can be used to generate this signature.
- How to store these signatures, that is the layout of database storing these signatures is highly platform dependent. The main requirement from this database is to give a list of blocks generating same signature (having same data), something like a hash bucket storing all elements generating same hash value.
- Another important requirement is that de dup should work while the file system is online. Putting the file system off line is not an option. Hence if write comes on a block for which signature has been stored with the data base needs to regenerate the signature in order to keep up with the latest data. This definitely needs a trap in the IO path but it should be such that it should have a minimal impact on the IO performance.
- Please keep in mind that de dup will only de dup the data blocks and not the meta data. Meta data is duplicated on purpose and should not be touched.
- So for the very first time de dup is started, it will generate signatures for all the data blocks in the file system. Once this pass is finished, we have all the information in the data base. Traversing this data base will give us a list of blocks bearing same data.
- For these blocks, only one copy can be kept while other blocks will be freed and the meta data of freed blocks will point to the one copy.
- One side effect of de duplication is that the next time a write comes on some block, we need to know if this block is sharing data with some other data. If it is, we need to do a copy-on-write here. Basically we allocate new block, write data on the new block and update meta data of the block. This way, the writes might have to bear a read penalty.
- For file systems which have copy-on-write in IO path like WAFL and ZFS, this would be not a problem. Other file systems would have to bear this penalty.
- Getting the data base in core is another problem. Either it needs to be implemented as cache otherwise it will occupy a lot of space. This would be very implementation specific.
Any more thoughts?
Update : Curtis Preston explains this in a very simple manner. Have a look at this - http://www.backupcentral.com/content/view/175/47/

4 comments:

Atul said...

Hmm technology is interesting. But I am not sure if this will be a driving factor for buying storage. Storage is getting cheaper day by day. Take example for Sun Fire X4500 server aka Thumper. It packs 24 TB of storage in 2U chasis and costs USD 24000 i.e. approximately 1$/GB which is dirt cheap. Combine it with Solaris 10, you have a ZFS/CIFS/iSCSI server for no extra cost(than hardware). I bet Netapp would be charging tons of money for de-dup storage stack.

Anand said...

The benefit is that it reduces the space utilized in your production data center and makes room for more. Of course storage is getting cheaper by the day. But if you are maintaining a farm of servers, getting such a benefit is absolutely an icing on the cake. Netapp claims reduction up to 20% of file system size. And sure, Netapp would be charging a lot of money. Think of ZFS, it can also implement de duplication and claim space savings since COW is inherent. Just as taking a snapshot does not affect zfs performance, so wouldn't de dup.

Anonymous said...

In spite of the usual arguments of "we shouldn't spend the time thinking about this because disks are cheap and getting cheaper all the time" (the same argument that lead microsoft to ignore the performance problems with vista), I think there are plenty of occasions where this will result in DRAMATIC savings of disk space which will outweigh any concerns over gains that won't justify the cost of implementing vs the cost of just buying more disks.

For instance, imagine taking VMWare Virtual Consolidated Backups of VM's using the fullvm method. There's no way to natively generate differentials, so each backup is a full copy of the entire VM including disks (which generally only change a few percent per day.)

In this case, backing up say a reasonably standard 10GB server image without dedupe would take say around 10GB per backup. With dedupe, the first backup would take 10GB, and the subsequent backups would take between 50mb and 100mb (representing the small percentage of blocks that have changed.)

6-hourly backups on a medium sized cluster of say 25 VMs of this average size: dedupe could save you close to a terabyte a day (which at the $1/gb figure quoted by atul is a grand a day, or a million bucks every three years...)

Anand said...

The blog world is full with people claiming de dup disk space wins. And in VMWare environments, the disk space wins are particularly higher. The figures you have quoted are another example of tremendous space savings. Netapp these days, is aggressively pushing de dup wherever possible. If you take a look at the new product from startup "Greenbytes", they have implemented de dup for ZFS which is another addition to the ZFS portfolio. IMHO, native implementation of dedup would be beneficial for the lifetime of your data center.