Has anyone tried to use GlusterFS or Ceph as the backend for Hadoop? I am not talking about just use plugin to sew things up. Is the performance better than HDFS itself? whether it's ok for production usage.
Also, Is it a really good idea to merge object storage, hadoop hdfs storage all together as a single storage? or it's better keep them separated.
Critical Review Red Hat Gluster Storage is a good solution for NAS that can be access and it is good for DEVOPS type environment, it has a high capacity for backup and archive and it has a high performance in analytics and virtualization.
It is used by several big companies and institutions (Facebook, Yahoo, Linkedin, etc). Ceph is a quite young file-system that has been designed in order to guarantee great scalability, performance and very good high availability features.
Overview. Ceph: scalable object storage with block and file capabilities. Gluster: scalable file storage with object capabilities. The differences, of course, are more nuanced than this, based on they way each program handles the data it stores.
I have used GlusterFS before, it has some nice features but finally I choose to use HDFS for distributed file system in Hadoop.
The nice thing about GlusterFS is that it doesn't require master-client nodes. Every node in cluster are equally, so there is no single point failure in GlusterFS. And one more thing I find interesting thing in GlusterFS is that it has glusterfs-client module, http://www.jamescoyle.net/how-to/439-mount-a-glusterfs-volume, when you want to store a file to glusterfs, you don't need to interface with GlusterFS apis, you just need to copy the file to mounted volume in glusterfs-client and get the job done so simple.
But I find that GlusterFS is hard to integrate to Hadoop ecosystem such as Spark, Mapreduce, ect.. where HDFS is supported by all most any components in Hadoop ecosystem. I think GlusterFS is good to build a cluster system like files storage independent from Hadoop.
I have tried Ceph as "drop-in" HDFS replacement in Hadoop 2.7 and after solving many integration issues have found it two/three times slower than HDFS with default replication factor in terasort benchmark. I don't know the reason for this. Other folks tried different approach with similar result:
http://www.snia.org/sites/default/files/SDC15_presentations/cloud_files/YuanZhou_big_data_analytics_on_object_store_r3.pdf
Is it good idea to combine object and hdfs storage? I think the question is not correct. Both HDFS (via Ozone and FUSE) and Ceph provide ability to use them as object storage and regular POSIX filesystems, with Ceph having an edge offering block storage as well, while HDFS this is currently discussed: https://issues.apache.org/jira/browse/HDFS-11118 If it is a question of "can I expose my storage as POSIX FS, Object, Block store at the same time?" Then the answer would be if your design satisfy your requirements for scalability and high availability, it could be a great idea actually.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With