Saturday, May 31, 2008

Deduplication anyone?

De duplication is one of the hot topics in storage world. With tons of vendors offering de dup products and biggies like Netapp offering integrated de dup solutions with their NAS products, the competition is fierce. How does du duplication help when actually the consumer is trying to keep redundant data in order to facilitate disaster recovery?
Effectively de dup is doing opposite of what RAID, replication, snap shots do. This is not exactly how it sounds. The de dup essentially tries to take a whole different approach in order to save disk space on a file system. The granularity of de dup could be a file or a file system block. If done at file level, it would de dup less data since very few times are two files are entirely identical. But blocks could be identical very often and it would definitely save more space. We would discuss block based de dup here.
- De dup calculates a kind of identity signature for each block on the file system and stores it with a data base. Now the blocks containing the same data will generate same signature and can be detected to be a duplicate of an existing block. A cryptographic hash algorithm like MD5 or SHA1 can be used to generate this signature.
- How to store these signatures, that is the layout of database storing these signatures is highly platform dependent. The main requirement from this database is to give a list of blocks generating same signature (having same data), something like a hash bucket storing all elements generating same hash value.
- Another important requirement is that de dup should work while the file system is online. Putting the file system off line is not an option. Hence if write comes on a block for which signature has been stored with the data base needs to regenerate the signature in order to keep up with the latest data. This definitely needs a trap in the IO path but it should be such that it should have a minimal impact on the IO performance.
- Please keep in mind that de dup will only de dup the data blocks and not the meta data. Meta data is duplicated on purpose and should not be touched.
- So for the very first time de dup is started, it will generate signatures for all the data blocks in the file system. Once this pass is finished, we have all the information in the data base. Traversing this data base will give us a list of blocks bearing same data.
- For these blocks, only one copy can be kept while other blocks will be freed and the meta data of freed blocks will point to the one copy.
- One side effect of de duplication is that the next time a write comes on some block, we need to know if this block is sharing data with some other data. If it is, we need to do a copy-on-write here. Basically we allocate new block, write data on the new block and update meta data of the block. This way, the writes might have to bear a read penalty.
- For file systems which have copy-on-write in IO path like WAFL and ZFS, this would be not a problem. Other file systems would have to bear this penalty.
- Getting the data base in core is another problem. Either it needs to be implemented as cache otherwise it will occupy a lot of space. This would be very implementation specific.
Any more thoughts?
Update : Curtis Preston explains this in a very simple manner. Have a look at this - http://www.backupcentral.com/content/view/175/47/