GlusterFS or Ceph as backend for Hadoop

Tags:

Has anyone tried to use GlusterFS or Ceph as the backend for Hadoop? I am not talking about just use plugin to sew things up. Is the performance better than HDFS itself? whether it's ok for production usage.

Also, Is it a really good idea to merge object storage, hadoop hdfs storage all together as a single storage? or it's better keep them separated.

574

asked Dec 02 '15 11:12

Shengjie

2 Answers

I have used GlusterFS before, it has some nice features but finally I choose to use HDFS for distributed file system in Hadoop.

The nice thing about GlusterFS is that it doesn't require master-client nodes. Every node in cluster are equally, so there is no single point failure in GlusterFS. And one more thing I find interesting thing in GlusterFS is that it has glusterfs-client module, http://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume, when you want to store a file to glusterfs, you don't need to interface with GlusterFS apis, you just need to copy the file to mounted volume in glusterfs-client and get the job done so simple.

But I find that GlusterFS is hard to integrate to Hadoop ecosystem such as Spark, Mapreduce, ect.. where HDFS is supported by all most any components in Hadoop ecosystem. I think GlusterFS is good to build a cluster system like files storage independent from Hadoop.

188

answered Sep 30 '22 00:09

Manh Hoang Ha

I have tried Ceph as "drop-in" HDFS replacement in Hadoop 2.7 and after solving many integration issues have found it two/three times slower than HDFS with default replication factor in terasort benchmark. I don't know the reason for this. Other folks tried different approach with similar result:

http://www.snia.org/sites/default/files/SDC15_presentations/cloud_files/YuanZhou_big_data_analytics_on_object_store_r3.pdf

Is it good idea to combine object and hdfs storage? I think the question is not correct. Both HDFS (via Ozone and FUSE) and Ceph provide ability to use them as object storage and regular POSIX filesystems, with Ceph having an edge offering block storage as well, while HDFS this is currently discussed: https://issues.apache.org/jira/browse/HDFS-11118 If it is a question of "can I expose my storage as POSIX FS, Object, Block store at the same time?" Then the answer would be if your design satisfy your requirements for scalability and high availability, it could be a great idea actually.

answered Sep 30 '22 00:09

Dmitry Buzolin

Related questions
                            
                                How to use NOT IN in Hive
                            
                                realtime querying/aggregating millions of records - hadoop? hbase? cassandra?
                            
                                Get input file name in streaming hadoop program
                            
                                Errors while running hadoop
                            
                                Type mismatch in key from map: expected .. Text, received ... LongWritable
                            
                                HBase 0.92 warnings about SLF4J bindings
                            
                                "Connection refused" Error for Namenode-HDFS (Hadoop Issue)
                            
                                What is the maximum value for mapreduce.task.io.sort.mb?
                            
                                Why Hadoop or Spark? There is ElasticSearch
                            
                                How can I debug a pig script
                            
                                How can I list subdirectories recursively for HDFS?
                            
                                Duplicate columns in Spark Dataframe
                            
                                Structure Difference between partitioning and bucketing in hive
                            
                                Hadoop HDFS maximum file size
                            
                                Partition Hive table by existing field?
                            
                                Hadoop read multiple lines at a time
                            
                                Hadoop slowstart configuration
                            
                                Why is Maven trying to compile my code as -source 1.3?
                            
                                Name Node stores what?
                            
                                Hadoop log4j not working as No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

GlusterFS or Ceph as backend for Hadoop

Tags:

hadoop

glusterfs

ceph

Shengjie

People also ask

2 Answers

Manh Hoang Ha

Dmitry Buzolin

Recent Activity

Donate For Us