Various websites (like Hortonworks) recommend to not configure RAID for HDFS setups mainly because of two reasons: <ol> <li>Speed limited to slower disk (JBOD performs better).</li> <li>Reliability</li> </ol> It is recommended to use RAID on NameNode. But what about implementing RAID on each DataNode storage disk?

RAID is used for two purposes. Depending on the RAID configuration you can get: <ol> <li>Better performance: reading a file can be spread over multiple disks or different disks can be transparently used to read multiple files from the same file system.</li> <li>Fault-tolerance: Data is replicated or stored using parity bits on multiple disks. If a disk fails, it can be recovered from another replica or recomputed using the parity bits.</li> </ol> HDFS has similar mechanisms built in software. HDFS splits files into chunks (so-called file blocks) which are replicated across multiple datanodes and stored on their local filesystems. Usually, datanodes have multiple disks which are individually mounted (JBOD). A datanode should distribute its file blocks across all its disks / local filesystems. This ensures: <ol> <li>Fault-tolerance: If a disk or node goes down, other replicas are available on different data nodes and disks.</li> <li>High sequential read/write performance: By splitting a file into multiple chunks and storing them on different nodes (and different disks), a file can be read in parallel by concurrently accessing multiple disks (on different nodes). Each disk can read data with its full bandwidth and its read operations do not interfere with other disks. If the cluster is well utilized all disks will be spinning at full speed delivering the maximum sequential read performance.</li> </ol> Since HDFS is taking care of fault-tolerance and "striped" reading, there is no need to use RAID underneath an HDFS. Using RAID will only be more expensive, offer less storage, and also be slower (depending on the concrete RAID config). Since the namenode is a single-point-of-failure in HDFS, it requires a more reliable hardware setup. Therefore, the use of RAID is recommended on namenodes.

Why is RAID not recommended for Hadoop HDFS setups?

1 Answers

RAID is used for two purposes. Depending on the RAID configuration you can get:

Better performance: reading a file can be spread over multiple disks or different disks can be transparently used to read multiple files from the same file system.
Fault-tolerance: Data is replicated or stored using parity bits on multiple disks. If a disk fails, it can be recovered from another replica or recomputed using the parity bits.

HDFS has similar mechanisms built in software. HDFS splits files into chunks (so-called file blocks) which are replicated across multiple datanodes and stored on their local filesystems. Usually, datanodes have multiple disks which are individually mounted (JBOD). A datanode should distribute its file blocks across all its disks / local filesystems.

This ensures:

Fault-tolerance: If a disk or node goes down, other replicas are available on different data nodes and disks.
High sequential read/write performance: By splitting a file into multiple chunks and storing them on different nodes (and different disks), a file can be read in parallel by concurrently accessing multiple disks (on different nodes). Each disk can read data with its full bandwidth and its read operations do not interfere with other disks. If the cluster is well utilized all disks will be spinning at full speed delivering the maximum sequential read performance.

Since HDFS is taking care of fault-tolerance and "striped" reading, there is no need to use RAID underneath an HDFS. Using RAID will only be more expensive, offer less storage, and also be slower (depending on the concrete RAID config).

Since the namenode is a single-point-of-failure in HDFS, it requires a more reliable hardware setup. Therefore, the use of RAID is recommended on namenodes.

114

answered Sep 30 '22 14:09

Fabian Hueske

Related questions
                            
                                Connect Hive through Java JDBC
                            
                                Hive table locks
                            
                                Difference between job, application, task, task attempt logs in Hadoop, Oozie
                            
                                Namenode high availability client request
                            
                                How to pick random (small) data samples using Map/Reduce?
                            
                                Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?
                            
                                Problems with Hadoop distcp from HDFS to Amazon S3
                            
                                Hive : Insert overwrite multiple partitions
                            
                                Hive - LIKE Operator
                            
                                how to write case and group by in hive query
                            
                                Difference between combiner and partitioner
                            
                                Hadoop Hive slow queries
                            
                                How does hdfs mv command work
                            
                                How to print hive beeline output without header and non tabular form
                            
                                Cannot create directory in hdfs NameNode is in safe mode
                            
                                how does hdfs choose a datanode to store
                            
                                How to get nth row of Spark RDD?
                            
                                How to generate Date Series in HIVE? (Creating table)
                            
                                AM Container is running beyond virtual memory limits
                            
                                Increase number of Hive mappers in Hadoop 2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is RAID not recommended for Hadoop HDFS setups?

Tags:

hadoop

hdfs

aditya ambre

People also ask

1 Answers

Fabian Hueske

Recent Activity

Donate For Us