Saturday, December 22, 2007

How about a cluster File System???

Was just speculating about what exactly would I need, if i have to write a cluster file system? First, let us throw some light on the requirements. The file system needs to provide reliable storage, should support multiple read writes simultaneously, good performance and one of the most important fault tolerant.
A cluster friendly file system basically needs a fine grained and efficient Distributed Lock Manager, transport level protocols that support range locking, namely NFSv4 and CIFS and moreover it needs a cluster protocol to manage operations across nodes. Usually this cluster protocol runs over high speed networks like Infiniband, Gigabit ethernet (still slower), fiber channel so as to enhance the throughput.
A cluster is built for the purpose of high performance computing. So if the IO throughput is not good, it serves no purpose to build a cluster. How I see a cluster is a collection of bunch of machines working together under certain code to gain higher throughputs and thus reflecting one entity as a whole. So this would fit perfectly to a collection of commodity computers. I am not talking about custom build rack mounted clusters.
Basically IO throughput is enhanced using striping the data across all nodes in the cluster. This way we can do a parallel read/writes and read aheads. Even if this increases the amount of meta data that needs to be maintained for a file, it increases the throughput on a larger scale. Most of the existing clusters including the google file system, luster implement same techniques.
A good light weight Distributed Lock Manager would help in minimizing the locking periods across files. Major files operations lock the parent directory. So fine grained locking would help to keep contention at the lowest possible level.
But making the cluster fault tolerant is one heck of a task. Say, while watching a movie over the cluster, if the next frame is unavailable, the movie player will halt waiting for the frame. The catch is to get the data within limits of application time out. If the backend is built with RAID, the application has to wait until RAID rebuild is complete. And this time is too big and the application will time out for sure. So how to solve this problem? There is no definite answer here. Only thing we can possibly do is to take a top down approach and build a framework to support data losses. These strategies include RAID (for data regeneration in case of losses), CRCs to detect corruptions etc. Still not fool proof :(
Managing data losses/outages is a tricky question and not completely answered. If the cluster is serving a data intensive application with utter need of uptime, probably keeping a copy of data would serve the purpose. This is a hell lot space inefficient but would save you time for sure. The other option is going the RAID way. RAID is built to be space efficient but the RAID rebuilds are really inefficient in practical life.