Google File System
The Google File System (GFS) is a scalable distributed file system designed and developed by Google for distributed data intensive applications. GFS was born out of the need to meet the rapidly growing data processing needs of Google. The design of the GFS shared many of the same goals (e.g. concurrency, scalability, availability and reliability) as previous distributed file systems, but differed from earlier file systems to meet the demands of application workloads and technological environment at Google. Almost a decade later, most of Google’s applications rely on GFS to store and process data. Although Google has not published the GFS code, the design of GFS is discussed in detail, in a paper (titled “The Google File System”) published by Google engineers. To explore more about the design of GFS, one needs to read the original paper present at http://labs.google.com/papers/gfs.html.
- Google uses thousands of storage machines built from cheap commodity hardware (typical Linux machines) and any of these machines could fail and never recover from failures. Hence the file system had to incorporate monitoring, error detection and recovery mechanisms.
- Google’s Web applications generate and consume files of sizes varying from few hundred megabytes to several terabytes. Files with small size were almost non-existent. For e.g. web crawlers employed by Google’s Search Engine, continuously scan internet and store information related to millions of web pages. Hence they needed a file system which could handle huge blocks of data.
- Most of operations on the files involved either large streaming reads (very few random reads) or large sequential appends to the end of file. Random writes within the file were almost non-existent. In large streaming reads, clients typically read 1MB or more of data. Overall they needed a file system optimized for reading and writing huge chunks of data in streaming mode.
- The files were often used as producer-consumer queues, with multiple producers writing into the same file. Hence the file system had to provide APIs to support concurrent appends to files, with minimum synchronization overhead.
- During early days, Google had only search engine, for which huge amount of data was processed in the background (e.g. generating inverted indexes for web pages). At that time, they did not have any user facing web applications like Gmail or Youtube which are sensitive to latency (i.e. require low latency). Hence they needed a file system tailored for batch oriented operations (higher throughput) than for latency oriented operations (lower latency)
- A master server for maintaining the file system metadata. There would be one active master per cluster.
- Chunk servers for storing the actual chunks of data. Each file is divided into several chunks and each chunk is of size 64 MB. And each chunk is replicated at least 3 times. For e.g. let’s assume that a file “a.txt” has chunks c1, c2 and c3. Each of these chunks will have at least 2 more replicas, e.g. c1’ and c1’’, c2’ and c2’’, c3’ and c3’’. These chunks are usually placed on different machines in order to ensure availability in case of machine failures. Users can override this default replication factor of 3 and specify their own replication factor for each of the files.
- GFS client, which will be used by applications for reading, writing or deleting data. GFS client provides APIs like create, delete, open, close, read and write. Apart from these standard APIs, GFS provides snapshot API for creating replicas and record append API for concurrent appends to a same file.
Master maintains the file system metadata. For e.g. file namespaces, mapping between file names to chunk locations. Chunk servers send regular heartbeat messages to the master indicating their health and changes in the chunk status (if any). For e.g. a chunk could get corrupted (this is determined using a checksum) or could have outdated data (outdated chunk is determined using chunk version number). Whenever a chunk server dies, all the chunks present on that chunk server need to be re-replicated. Master comes to know about the death of a chunk server if it does not receive a heartbeat message within a configured interval. Master places chunks in such a way that the data is distributed evenly across all the machines within a cluster.
Certain metadata, e.g. file namespaces and file to chunk mapping, is kept in persistent state on the master’s disk. In case of a crash, master recovers by reading the metadata stored on the disk. This data is also replicated to shadow masters at regular intervals of time. If master machine itself crashes and it is not possible to restart the master, then one of the shadow masters takes over.GFS implements lazy garbage collection mechanism for removing the deleted data. Deleted files are not removed immediately. They are garbage collected at a later point of time. This helps in undoing accidental deletes, which could be costly considering the size of data.
Leasing mechanism is used to maintain data consistently across all the chunks. The GFS client has to obtain a lease on a chunk to do any data mutation on that chunk. Till the lease on that chunk expires, other clients cannot access that chunk for any data mutation. Any mutation to a chunk, is replicated to all the chunk replicas and the mutations are applied in a consistent order to all the replicas. For e.g. if data blocks A, B and C are written to primary chunk c1, then secondary chunks c1’ and c1’’ also get the data in the same order, i.e. A, B and C. This ensures data consistency on all the chunks.
Application code is linked with GFS client library. For any operation, client first contacts the master for getting the chunk location and lease on that chunk (in case of mutations). Once the chunk location is obtained, the client directly contacts the chunk servers to read, write or delete the data (by bypassing the master).Google’s publications on GFS and MapReduce (a programming model for distributed data processing) have inspired an open source project named Hadoop (http://wiki.apache.org/hadoop/HDFS?action=show&redirect=DFS). If you want to explore Hadoop, check: http://hadoop.apache.org/.
Exponential growth of internet and proportionate growth in data has exposed some of the drawbacks of GFS. This has prompted Google to rethink on some of the initial design decisions. Some of the drawbacks of earlier system are:- It was designed mainly for batch centric applications, i.e. the applications which need to process huge amount of data in batch mode and are not sensitive to latency. With Google Search Engine becoming immensely popular, Google added other applications like Gmail, Youtube etc, which are sensitive to latency. Hence if these applications were to use GFS, certain adjustments had to be made to the file system.
- To simplify the design, GFS was implemented with a single master node, which maintains the file system metadata for entire cluster. By initial estimates, GFS was expected to handle few million files with sizes up to few terabytes. But the demands for data grew from terabytes to petabytes. This increased the size of metadata maintained by the single master. This in turn increased the processing time at master node and limited the number of client requests that a master can handle within a specified period of time.
Over the years, some of these drawbacks have been managed by tweaking the file system or tweaking the applications which used this file system. Engineers at Google have been working on a new distributed master system (as opposed to single master design) to solve some of the problems of GFS. If you are interested in knowing how the file system has evolved over the years, you can check this recently published ACM link: http://queue.acm.org/detail.cfm?id=1594206.

