Google File System Paper -
Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file sys- tem block sizes. Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed. Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size.
What is lazy space allocation and how is it going to solve the internal fragmentation problem?
A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file ... We fixed this problem by storing such executables with a higher replication factor and by making the batch- queue system stagger application start times.
What is staggering application start times and how does it avoid chunk-servers from becoming hot-spots?
A chunk consists of 64 KB blocks and each block has a 32 bit checksum. Chunks are stored on Linux files systems and are replicated on multiple sites; a user may change the number of the replicas, from the standard value of three, to any desired value.
All metadata of chunks are stored in the memory of master server in order to reduce the latency and increase the throughput. Large chunks means less metadata, and less metadata means less load time of the metadata.
Files stored in GFS are very large, even to PB, there is no such big disk to store it. Instead of mutable size, fixed-size chunks are easy for indexing and querying. Actually, the size of each chunk is not small, around 64MB, also a big size, in this way, it can reduce the number of metadata needed by GFS.
– Advantages: » Reduces interaction w/ master. » Reduces metadata stored on master. – Disadvantages: » Small files may become hotspots. Master: Single node maintains all of the metadata such as namespace, ACLs, mapping from files to chunks, and current location of chunks.
Lazy space allocation means the filesystem doesn't actually give the file space before it's written. They're commonly referred to as sparse files. For example, if only the first 2MB of the 64MB chunk file is used, only 2MB will actually be used on disk.
Staggering application start times just means that they don't start everything at once. If every application needs to read a few configuration files stored in GFS upon startup, if they all start at the same time, there will be load problems. Spreading out the startup times alleviates this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With